Genuine Blog

An outlet of thoughts and insights on scalable contemporary software technology on distributed computing platforms, with proof-of-concept applications using Scala, Akka, Apache Spark, Kafka, SQL/NoSQL, AWS.
关于分布式计算平台上可扩展的当代软件技术的思想和见解,以及使用Scala,Akka,Apache Spark,Kafka,SQL/NoSQL,AWS的概念验证应用程序。

Yet Another Blogger
又一个博主

I read tech blogs from time to time but have never published one myself. So, why now? Short answer: Feeling a bit obligated …
我不时阅读技术博客,但自己从未发表过。那么,为什么是现在呢?简短的回答:感觉有点义务...

Despite a much needed reality check in 2000/2001 resulting in a rather painful setback, the technology industry has managed to prosper over the past two decades. Thanks to the relentless pursuit by a worldwide community of ever better software technologies in the form of open-source software and public knowledge base, we all benefit tremendously from having an abundant supply of tools and resource. Open-source software was mostly toy-ish or experimental stuff just 10-15 years ago. It now serves as the underlying technology foundation for many businesses. The vast knowledge base in the form of wikis, forums, blogs and Q&A sites have also made learning and solution seeking so much easier nowadays. It would be beyond joyous had Wikipedia, StackOverflow been available back then.
尽管在2000/2001年进行了非常必要的现实检查,导致相当痛苦的挫折,但技术行业在过去二十年中仍然设法繁荣起来。由于全球社区以开源软件和公共知识库的形式对更好的软件技术的不懈追求,我们都从拥有丰富的工具和资源中受益匪浅。开源软件在10-15年前大多是玩具或实验性的东西。它现在是许多企业的基础技术基础。wiki,论坛,博客和问答网站形式的庞大知识库也使当今的学习和解决方案寻求变得更加容易。如果维基百科,StackOverflow当时可用,那将是无比快乐的。

Throughout my 20-year career playing both the software developer and engineering management roles, I’ve learned a lot from my jobs and my colleagues, but more so from the said technology community. It has almost become an obligation that I should try pay back to the community, in a however small way. Besides some random thoughts on the state of the evolving technology industry or interesting things observed in my career, I’m hoping to produce also materials at the code level that are applicable to the very blog topic.
在我 20 年的软件开发和工程管理职业生涯中,我从我的工作和同事那里学到了很多东西,但从上述技术社区学到了更多。这几乎已经成为一种义务,我应该尝试以很小的方式回报社区。除了对不断发展的技术行业状况或在我的职业生涯中观察到的有趣事物的一些随机想法之外,我还希望在代码级别制作适用于博客主题的材料。

Since the departure from my most recent cleantech startup venture (EcoFactor) as its founding technology head, I’ve been working on some R&D projects on my own and have been doing a bit more coding myself – which is usually fun except when wrestling with some stubborn bugs. The slightly heavier coding work is indeed a refreshing exercise and is going to better equip me to give topical code examples, should needs arise.
自从我离开我最近的清洁技术创业公司(EcoFactor)作为其创始技术负责人以来,我一直在自己从事一些研发项目,并且自己也做了更多的编码 - 这通常很有趣,除非与一些顽固的错误搏斗。稍微繁重的编码工作确实是一项令人耳目一新的练习,如果需要,它将更好地使我能够给出主题代码示例。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Programming Exercise – Sorting Algorithm
编程练习 – 排序算法

Unless algorithm development is part of the job, many software engineers use readily available algorithmic methods as needed and rarely need to develop algorithms themselves. I’m not talking about some complicated statistical or machine learning algorithms. Just simple mundane ones such as a sorting algorithm. Even if you don’t need to code algorithms, going back to writing a sorting program can still be a good exercise to review certain basic skills that might not be frequently used in your current day job. It’s a good-sized programming exercise that isn’t too trivial or taking up too much time. It also reminds you some clever (or dumb) tricks on how to perform sorting by means of recursive divide-and-merge, pivot partitioning, etc. And if nothing else, it might help you in your next technical job interview in the near future.
除非算法开发是工作的一部分,否则许多软件工程师根据需要使用现成的算法方法,很少需要自己开发算法。我不是在谈论一些复杂的统计或机器学习算法。只是简单的平凡,例如排序算法。即使你不需要对算法进行编码,回到编写排序程序仍然是一个很好的练习,可以复习某些在你当前工作中可能不经常使用的基本技能。这是一个大小适中的编程练习,不会太琐碎或占用太多时间。它还提醒您一些聪明(或愚蠢)的技巧,介绍如何通过递归除合并、透视分区等方式执行排序。如果没有别的,它可能会在不久的将来对你的下一次技术面试有所帮助。

If you’re up for such an exercise, first, look up from Wikipedia or any other suitable source for a sorting algorithm (e.g. Merge Sort, Quick Sort) of your choice to re-familiarize yourself with its underlying mechanism. Next, decide on the scope of the application – for example, do you want an integer-sorting application or something more generic? … etc. Next, pseudo code, pick the programming language of your choice, and go for it.
如果您准备进行这样的练习,首先,从维基百科或任何其他合适的来源中查找您选择的排序算法(例如合并排序,快速排序),以重新熟悉其底层机制。接下来,确定应用程序的范围 - 例如,您想要整数排序应用程序还是更通用的应用程序?...等。接下来,伪代码,选择你选择的编程语言,然后去做。

Appended is a simple implementation of both Merge Sort and Quick Sort in Java. For the convenience of making method calls with varying data types and sorting algorithms, an interface (SimpleSort) and a wrapper (Sorter) are used. Java Generics is used to generalize the sorter for different data types. Adding Generics to a sorting application requires the using of either the Comparable or Comparator interface, as ordering is necessary in sorting. In this example application, the Comparable interface is used since the default ordering is good enough for basic sorting. The overall implementation code is pretty self explanatory.
附加是 Java 中合并排序和快速排序的简单实现。为了方便使用不同的数据类型和排序算法进行方法调用,使用了接口 (SimpleSort) 和包装器 (Sorter)。Java 泛型用于泛化不同数据类型的排序器。将泛型添加到排序应用程序需要使用可比较或比较器接口,因为在排序中排序是必需的。在此示例应用程序中,使用 Comparable 接口,因为默认排序足以进行基本排序。整体实现代码不言自明。

SimpleSort.java

Sorter.java

MergeSort.java

QuickSort.java

SortingMain.java

1 thought on “Programming Exercise – Sorting Algorithm
关于“编程练习 – 排序算法”的 1 条思考

  1. Pingback: Programming Exercise – Binary Tree | Genuine Blog
    pingback: 编程练习 – 二叉树 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Real-time Big Data
实时大数据

Although demand for large scale distributed computing solutions has existed for decades, the term Big Data did not get a lot of public attention till Google published its data processing programming model, MapReduce, back in 2004. The Java-based Hadoop framework further popularized the term a couple of years later, partly due to the ubiquitous popularity of Java.
尽管对大规模分布式计算解决方案的需求已经存在了几十年,但大数据一词并没有引起公众的广泛关注,直到谷歌在2004年发布了其数据处理编程模型MapReduce。基于Java的Hadoop框架在几年后进一步普及了这个术语,部分原因是Java无处不在。

From Batch to Real-time
从批处理到实时

Hadoop has proven itself a great platform for running MapReduce applications on fault-tolerant HDFS clusters, typically composed of inexpensive servers. It does very well in the large-scale batch data processing problem domain. Adding a distributed database system such as HBase or Cassandra helps extend the platform to address the needs for real-time (or near-real-time) access to structured data. But in order to be able to use feature-rich messaging or streaming functionality, one will need some suitable systems that operate well on a distributed platform.
Hadoop已经证明自己是一个伟大的平台,可以在容错的HDFS集群上运行MapReduce应用程序,通常由廉价的服务器组成。它在大规模批量数据处理问题领域做得很好。添加分布式数据库系统(如 HBase 或 Cassandra)有助于扩展平台,以满足对结构化数据的实时(或近实时)访问需求。但是,为了能够使用功能丰富的消息传递或流媒体功能,人们将需要一些在分布式平台上运行良好的合适系统。

I remember feeling the need for such a real-time Big Data system when I was with a cleantech startup, EcoFactor, a couple of years ago seeking solutions to handle the increasingly demanding time-series data processing in a near-real-time fashion. It would have saved me and my team a lot of internal development work had such a system been available. One of my recent R&D projects after I left the company was to adopt suitable software components to address such real-time distributed data processing needs. The following highlights the key pieces I picked and what prompted me to pick them for the task.
我记得几年前,当我在一家清洁技术初创公司EcoFactor工作时,我感到需要这样一个实时大数据系统,寻求解决方案,以近乎实时的方式处理日益苛刻的时间序列数据处理。如果有这样的系统,这将为我和我的团队节省大量的内部开发工作。我离开公司后最近的一个研发项目是采用合适的软件组件来满足这种实时分布式数据处理需求。下面重点介绍了我选择的关键部分以及促使我为任务选择它们的原因。

Over the past couple of years, Kafka and Storm have emerged as two maturing distributed data processing systems. Kafka was developed by LinkedIn for high performance messaging, whereas Storm, developed by Twitter (through the acquisition of BackType), addresses the real-time streaming problem space. Although the two technologies were independently built, some real-time Big Data solution seekers see advantages bringing the two together by integrating Kafka as the source queue with a Storm streaming system. According to a published blog earlier this year, Twitter also uses the Kafka-Storm combo to handle its demanding real-time search.
在过去的几年里,Kafka和Storm已经成为两个成熟的分布式数据处理系统。Kafka由LinkedIn开发用于高性能消息传递,而Storm由Twitter开发(通过收购BackType),解决了实时流问题空间。尽管这两种技术是独立构建的,但一些实时大数据解决方案寻求者看到了通过将Kafka作为源队列与Storm流系统集成在一起的优势。根据今年早些时候发表的博客,Twitter还使用Kafka-Storm组合来处理其苛刻的实时搜索。

High-performance Distributed Messaging
高性能分布式消息传递

Kafka is a distributed publish-subscribe messaging system equipped with robust semantic partitioning and high messaging throughput by leveraging kernel-managed disk cache. High performance is evidently a key initiative in Kafka’s underlying architecture. It adopts the design principle that leverages kernel page caching to minimize data copying and context switching for higher messaging throughput. It also uses message grouping (MessageSet) to reduce network calls. At-least-once message processing is guaranteed. If exactly-once is a business requirement, one approach is to programmatically keep track of the messaging state by coupling the data with the message offset to eliminate duplication.
Kafka 是一个分布式发布-订阅消息传递系统,通过利用内核管理的磁盘缓存,配备强大的语义分区和高消息传递吞吐量。高性能显然是 Kafka 底层架构中的一个关键举措。它采用利用内核页面缓存的设计原则来最小化数据复制和上下文切换,从而提高消息传递吞吐量。它还使用消息分组 (MessageSet) 来减少网络调用。保证至少一次消息处理。如果 exact-once 是业务需求,一种方法是通过将数据与消息偏移量耦合以编程方式跟踪消息传递状态,以消除重复。

Kafka flows data by having the publishers push data to the brokers (Kafka servers) and subscribers pull from the brokers, giving the flexibility of a more diverse set of message consumers in the messaging system. It uses ZooKeeper for auto message broker discovery for non-static broker configurations. It also uses ZooKeeper to maintain message topics and partitions. Messages can be programmatically partitioned over a server cluster and consumed with ordering preserved within individual partitions. There are two APIs for the consumer – a high-level consumer (ConsumerConnector) that heavily leverages ZooKeeper to handle broker discovery, consumer rebalancing and message offset tracking; and a low-level consumer (SimpleConsumer) that allows users to manually customize all the key messaging features.
Kafka 通过让发布者将数据推送到代理(Kafka 服务器)和订阅者从代理拉取来流动数据,从而为消息传递系统中更多样化的消息使用者提供了灵活性。它使用 ZooKeeper 自动发现非静态代理配置的消息代理。它还使用 ZooKeeper 来维护消息主题和分区。消息可以通过服务器群集以编程方式进行分区,并在各个分区中保留顺序。消费者有两个 API – 一个高级消费者(ConsumerConnector),它大量利用 ZooKeeper 来处理代理发现、消费者重新平衡和消息偏移跟踪;以及允许用户手动自定义所有关键消息传递功能的低级消费者(SimpleConsumer)。

Setting up Kafka on a cluster is pretty straight forward. The version used in my project is 0.7.2. Each Kafka server is identified by a unique broker id. Each broker comes with tunable configurations on the socket server, logs and connections to a ZooKeeper ensemble (if enabled). There are also a handful of configurable properties that are producer-specific (e.g. producer.type, serializer.class, partitioner.class) and consumer-specific (e.g. autocommit.interval.ms).
在集群上设置 Kafka 非常简单。我的项目中使用的版本是 0.7.2。每个 Kafka 服务器都由一个唯一的代理标识。每个代理在套接字服务器上都有可调配置、日志和与 ZooKeeper 集合的连接(如果已启用)。还有一些可配置的属性是特定于生产者的(例如 producer.type、serializer.class、partitioner.class)和特定于使用者的属性(例如 autocommit.interval.ms)。

The following links detail Kafka’s core design principles:
以下链接详细介绍了 Kafka 的核心设计原则:

http://kafka.apache.org/07/design.html
http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

Real-time Distributed Streaming
实时分布式流式处理

Storm is a distributed streaming system that streams data through a customizable topology of data sources (spouts) and processors (bolts). It uses ZooKeeper, as well, to coordinate among the master and worker nodes in a cluster and manage transactional state as needed. Streams (composed of tuples), spouts and bolts constitute the core components of a Storm topology. A set of stream grouping methods are provided for partitioning a stream among consuming bolts in accordance with various use cases. The Storm topology executes across multiple configurable worker processes, each of which is a JVM. A Thrift-based topology builder is used to define and submit a topology.
Storm 是一个分布式流系统,它通过数据源(喷口)和处理器(螺栓)的可自定义拓扑来流式传输数据。它还使用 ZooKeeper 来协调集群中的主节点和工作节点,并根据需要管理事务状态。流(由元组组成)、喷口和螺栓构成了 Storm 拓扑的核心组件。提供了一组流分组方法,用于根据各种用例在消耗螺栓之间划分流。Storm 拓扑跨多个可配置的工作进程执行,每个工作进程都是一个 JVM。基于 Thrift 的拓扑生成器用于定义和提交拓扑。

On reliability, Storm guarantees that every tuple from a spout gets fully processed. It manages the complete lifecycle of each input tuple by tracking its id (msgId) and using methods ack(Object msgId) and fail(Object msgId) to report processing result. By anchoring the input tuple from the spout to every associated tuple being emitted in the consuming bolts, the spout can replay the tuple in the event of failure. This ensures at-least-once message processing.
在可靠性方面,Storm 保证来自 spout 的每个元组都得到完全处理。它通过跟踪每个输入元组的 id (msgId) 并使用方法 ack(Object msgId) 和 fail(Object msgId) 来报告处理结果,从而管理每个输入元组的完整生命周期。通过将输入元组从喷口锚定到消耗螺栓中发出的每个关联元组,喷口可以在发生故障时重播元组。这可确保至少一次消息处理。

Transactional Stream Processing
事务流处理

Storm’s transactional topology goes another step further to ensure exactly-once message processing. It processes tuples by batches, each identified by a transaction id which is maintained in a persistent storage. A transaction is composed of two phases – processing phase and commit phase. The processing phase allows batches to be proceeded in parallel (if the specific business logic warrants it), whereas the commit phase requires batches to be strongly ordered. For a given batch, any failure during the processing or commit phases will result in a replay of the entire transaction. To avoid over-update to a replayed batch, the current transaction id is examined against the stored transaction id within the strong-ordered commit phase and persisted update takes place only when the the id’s differ.
Storm 的事务拓扑更进一步,以确保恰好一次的消息处理。它按批处理处理元组,每个元组由一个事务 ID 标识,该事务 ID 在持久存储中维护。事务由两个阶段组成 – 处理阶段和提交阶段。处理阶段允许并行处理批处理(如果特定业务逻辑允许),而提交阶段要求对批处理进行强排序。对于给定的批处理,处理或提交阶段的任何故障都将导致整个事务的重播。为了避免过度更新重播的批处理,在强顺序提交阶段内根据存储的事务 ID 检查当前事务 ID,并且仅当 id 不同时才进行持久更新。

Then, there is this high-level abstraction layer, Trident API, on top of Storm that helps internalize some state management effort. It also introduces opaque transaction spouts to address failure cases in which loss of partial data source forbids replaying of the batch. It achieves such fault tolerance by maintaining in persistent storage a previous-state computed value (e.g. word count) in additional to the current computed value and transaction id. The idea is to reliably carry over partial value changes across strong-ordered batches, allowing the failed tuples in a partially failed batch to be processed in a subsequent batch.
然后,在 Storm 之上有一个高级抽象层 Trident API,它有助于内部化一些状态管理工作。它还引入了不透明的事务喷口,以解决部分数据源丢失禁止重播批处理的失败情况。它通过在持久存储中维护当前计算值和事务 ID 之外的先前状态计算值(例如字数)来实现这种容错。这个想法是可靠地在强序批处理中传递部分值更改,从而允许在后续批处理中处理部分失败批处理中的失败元组。

Deploying Storm on a production cluster requires a little extra effort. The version, 0.8.1, used in my project requires a dated version of ZeroMQ – a native socket/messaging library in C++, which in turn needs JZMQ for Java binding. To build ZeroMQ, UUID library (libuuid-devel) is needed as well. Within the cluster, the master node runs the “Nimbus” daemon and each of the worker nodes runs a “Supervisor” daemon. It also comes with a administrative web UI that is handy for status monitoring.
在生产群集上部署 Storm 需要一些额外的工作。我的项目中使用的版本 0.8.1 需要过时的 ZeroMQ 版本 – C++ 中的本机套接字/消息传递库,而它又需要 JZMQ 进行 Java 绑定。为了构建ZeroMQ,还需要UUID库(libuuid-devel)。在集群中,主节点运行“Nimbus”守护进程,每个工作节点运行一个“主管”守护进程。它还带有一个管理Web UI,便于状态监视。

The following links provide details on the topics of Storm’s message reliability, transactional topology and Trident’s essentials:
以下链接提供了有关 Storm 消息可靠性、事务拓扑和 Trident 要点主题的详细信息:

https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing
https://github.com/nathanmarz/storm/wiki/Transactional-topologies
https://github.com/nathanmarz/storm/wiki/Trident-state

And, 1 + 1 = 1 …
而且,1 + 1 = 1 ...

Both Kafka and Storm are promising technologies in the real-time Big Data arena. While they work great individually, their functionalities also complement each other well. A commonly seen use case is to have Storm being the center piece of a streaming system with Kafka spouts providing queueing mechanism and data processing bolts carrying out specific business logic. If persistent storage is needed which is often the case, one can develop a bolt to persist data into a NoSQL database such as HBase or Cassandra.
Kafka和Storm都是实时大数据领域很有前途的技术。虽然它们单独工作得很好,但它们的功能也很好地相互补充。一个常见的用例是让 Storm 成为流系统的核心部分,Kafka 喷口提供排队机制和执行特定业务逻辑的数据处理螺栓。如果需要持久存储(这种情况很常见),可以开发一个bolt将数据持久化到NoSQL数据库(如HBase或Cassandra)中。

These are all exciting technologies and are part of what makes contemporary open-source distributed computing software prosperous. Even though they’re promising, that doesn’t mean they’re suitable for every company that wants to run some real-time Big Data systems. At their current level of maturity, adopting them still requires some hands-on software technologists to objectively assess needs, design, implement and come up with infrastructure support plan.
这些都是令人兴奋的技术,也是使当代开源分布式计算软件繁荣的一部分。尽管它们很有前途,但这并不意味着它们适合每个想要运行一些实时大数据系统的公司。在目前的成熟度下,采用它们仍然需要一些动手的软件技术人员客观地评估需求,设计,实施并提出基础设施支持计划。

1 thought on “Real-time Big Data
关于“实时大数据”的 1 条思考

  1. Pingback: Real-time Big Data Revisited | Genuine Blog
    回击:重新审视实时大数据 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Streaming Real-time Data Into HBase
将实时数据流式传输到 HBase

Fast-write is generally a characteristic strength of distributed NoSQL databases such as HBase, Cassandra. Yet, for a distributed application that needs to capture rapid streams of data in a database, standard connection pooling provided by the database might not be up to the task. For instance, I didn’t get the kind of wanted performance when using HBase’s HTablePool to accommodate real-time streaming of data from a high-parallelism data dumping Storm bolt.
快速写入通常是分布式NoSQL数据库(如HBase,Cassandra)的特征优势。但是,对于需要在数据库中捕获快速数据流的分布式应用程序,数据库提供的标准连接池可能无法完成任务。例如,当使用 HBase 的 HTablePool 来适应来自高并行性数据转储 Storm bolt 的实时数据流时,我没有得到想要的性能。

To dump rapid real-time streaming data into HBase, instead of HTablePool it might be more efficient to embed some queueing mechanism in the HBase storage module. An ex-colleague of mine, who is the architect at a VoIP service provider, employs the very mechanism in their production HBase database. Below is a simple implementation that has been tested performing well with a good-sized Storm topology. The code is rather self-explanatory. The HBaseStreamers class consists of a threaded inner class, Streamer, which maintains a queue of HBase Put using LinkedBlockingQueue. Key parameters are in the HBaseStreamers constructor argument list, including the ZooKeeper quorum, HBase table name, HTable auto-flush switch, number of streaming queues and streaming queue capacity.
要将快速实时流数据转储到 HBase 中,而不是 HTablePool,在 HBase 存储模块中嵌入一些排队机制可能更有效。我的一位前同事是一家VoIP服务提供商的架构师,他在他们的生产HBase数据库中采用了这种机制。下面是一个简单的实现,它已经过测试,在大小合适的 Storm 拓扑中表现良好。代码是不言自明的。HBaseStreamers 类由一个线程内部类 Streamer 组成,该类使用 LinkedBlockingQueue 维护 HBase Put 队列。关键参数位于 HBaseStreamers 构造函数参数列表中,包括 ZooKeeper 仲裁、HBase 表名称、HTable 自动刷新开关、流式处理队列数和流式处理队列容量。

Next, write a wrapper class similar to the following to isolate HBase specifics from the streaming application.
接下来,编写类似于以下内容的包装类,以将 HBase 细节与流式处理应用程序隔离开来。

To test it with a distributed streaming application using Storm, write a bolt similar to the following skeleton. All that is needed is to initialize HBaseStreamers from within the bolt’s prepare() method and dump data to HBase from within bolt’s execute().
要使用 Storm 的分布式流应用程序对其进行测试,请编写类似于以下框架的 bolt。所需要的只是从 bolt 的 prepare() 方法中初始化 HBaseStreamers,并从 bolt 的 execute() 中将数据转储到 HBase。

Finally, write a Storm spout to serve as the streaming data source and a Storm topology builder to put the spout and bolt together.
最后,编写一个 Storm 喷口作为流数据源,并编写一个 Storm 拓扑生成器来将喷口和螺栓放在一起。

The parallelism/queue parameters are set to relatively small numbers in the above sample code. Once tested working, one can tweak all the various dials in accordance with the server cluster capacity. These dials include the following:
在上面的示例代码中,并行度/队列参数设置为相对较小的数字。测试工作后,可以根据服务器集群容量调整所有各种拨号。这些表盘包括以下内容:

For simplicity, only HBase Put is being handled in the above implementation. It certainly can be expanded to handle also HBase Increment so as to carry out aggregation functions such as count. The primary goal of this Storm-to-HBase streaming exercise is to showcase the using of a module equipped with some “elasticity” by means of configurable queues. The queueing mechanism within HBaseStreamers provides cushioned funnels for the data streams and helps optimize the overall data intake bandwidth. Keep in mind, though, that doesn’t remove the need of administration work for a properly configured HBase-Hadoop system.
为简单起见,在上述实现中仅处理 HBase Put。它当然可以扩展为处理HBase增量,以便执行计数等聚合功能。此 Storm-to-HBase 流练习的主要目标是展示如何通过可配置队列使用配备一些“弹性”的模块。HBaseStreamers 中的排队机制为数据流提供缓冲漏斗,并帮助优化整体数据引入带宽。但请记住,这并不能消除正确配置的HBase-Hadoop系统的管理工作需求。

3 thoughts on “Streaming Real-time Data Into HBase
关于“将实时数据流式传输到 HBase”的 3 条思考

  1. Bongsakorn
    凤沙坤 四月 11, 2014 3:59 上午

    Hello,

    I am learning about Storm to analysis real-time streaming. Your article is very interesting. Cloud you share all code?
    我正在学习 Storm 来分析实时流媒体。你的文章很有趣。云你共享所有代码?

    Reply
    回复 ↓
    1. leo Post author
      leo 帖子作者 四月 19, 2014 在 10:47 上午

      @Bongsakorn: Thank you for your comment. My apologies for the delay in response – have been swamped running a new startup and haven’t been able to dedicate any time for the tech blog in recent months. The listed code is part of a PoC app with specific business logics that aren’t suitable for publishing in its entirety. And, unfortunately I’m afraid I won’t have the bandwidth to massage/repurpose the code any time soon.
      @Bongsakorn:谢谢你的评论。对于延迟回复,我深表歉意——最近几个月,我被淹没在一家新创业公司中,并且无法为科技博客投入任何时间。列出的代码是 PoC 应用的一部分,具有不适合完整发布的特定业务逻辑。而且,不幸的是,恐怕我很快就没有足够的带宽来调整/重新利用代码。

      Reply
      回复 ↓
  2. Pingback: Text Mining With Akka Streams | Genuine Blog
    pingback: 使用 Akka 流进行文本挖掘 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

The War For Software Talent
软件人才之战

Having built software engineering teams for startups throughout my recent career, I must say that it’s not the easiest thing to do. It has gotten even tougher in the past few years as competition for software veterans has been more fierce than ever.
在我最近的职业生涯中,我为初创公司建立了软件工程团队,我必须说这不是最容易的事情。在过去的几年里,由于对软件老手的竞争比以往任何时候都更加激烈,它变得更加艰难。

A mundane supply-and-demand issue
平凡的供需问题

Prominent technology companies such as Google, Twitter, Amazon, have evolved over the past decade to become powerful titans by all standards. They have been vacuuming the talent space like black holes. Their continual sky-high stock value has led to spin-offs of child startups by their departed alumni who are often times some of the best talents themselves. These spun-off entrepreneurs were once key contributors but are no longer so, thus depleting the talent pool for hire.
谷歌、Twitter、亚马逊等知名科技公司在过去十年中已经发展成为所有标准的强大巨头。他们一直在像黑洞一样吸尘人才空间。他们持续的天价股票价值导致他们离去的校友分拆了儿童创业公司,这些校友往往是一些最优秀的人才。这些分拆的企业家曾经是关键的贡献者,但现在不再是,从而耗尽了可供雇用的人才库。

The spin-offs leave the mother companies no choice but to step up their vacuum machine power. Meanwhile, these spun-offs create their own vacuum machines and, with some elite company names in their profiles, instantly become additional black holes to scoop up talents. That’s all good in helping to prosper the technology industry. But it inconveniently starves the average startups of desperately needed veterans, and running the code below is not going to help.
分拆让母公司别无选择,只能加强其真空机功率。与此同时,这些分拆出来的公司创造了自己的真空机,并在他们的个人资料中加入了一些精英公司的名字,立即成为额外的黑洞来挖掘人才。这一切都有助于科技行业的繁荣。但它不方便地使普通初创公司缺乏迫切需要的退伍军人,运行下面的代码也无济于事。

Because of the imbalance in the supply-and-demand of top talents, the compensation package for them has skyrocketed to the point that the average startups are having a hard time to justify it. Unlike established and well-funded companies, these startups must watch where every single dime goes and a salary at the level of a small startup CEO’s range for a veteran engineer can be a deal breaker.
由于顶尖人才的供需不平衡,他们的薪酬方案飙升到普通初创公司很难证明其合理性的地步。与成熟且资金雄厚的公司不同,这些初创公司必须关注每一分钱的去向,而对于资深工程师来说,小型初创公司首席执行官的薪水可能会破坏交易。

Product Managers, QA, DevOps, …
产品经理、质量保证、开发运营...

For any given software engineering team, focus is often put on the engineers responsible for the very coding task. But any company who has ever built real-world products realizes how tough it would be to deliver them without an able product manager. Star product managers with adequate technical background and product management skills are equally hard to find. In all my previous startup ventures, product managers have always been some of the hardest roles to fill with the right people.
对于任何给定的软件工程团队,重点通常放在负责编码任务的工程师身上。但是,任何曾经构建过真实世界产品的公司都会意识到,如果没有一个有能力的产品经理,交付这些产品是多么困难。具有足够技术背景和产品管理技能的明星产品经理同样难找。在我之前的所有创业公司中,产品经理一直是最难找到合适人选的角色。

Good QA engineers are more difficult to find than R&D engineers. Many engineers simply lack the attributes, including patience and a detective mind, required to perform well in quality assurance. In addition, most people with strong programming skills prefer seeking R&D jobs to QA jobs. The end result is that many QA engineers come from a background of less stringent training in programming. In a technology demanding engineering team, that can be a barrier to carrying out quality QA tasks.
优秀的QA工程师比研发工程师更难找到。许多工程师只是缺乏在质量保证方面表现良好的品质,包括耐心和侦探般的头脑。此外,大多数具有强大编程技能的人更喜欢寻找研发工作而不是QA工作。最终的结果是,许多QA工程师的背景是编程培训不那么严格。在技术要求苛刻的工程团队中,这可能是执行高质量 QA 任务的障碍。

Experienced operations staff who handle systems and database administration have always been hard to find. The fact that I’ve been forced to play those roles myself on and off since my first startup venture in the 90’s tells a lot. With more and more cloud-based services arising and blurring the line between the software development and network operations worlds, DevOps were born. Instead of supporting non-technical users, they support highly technical software engineers. So, now we’re talking about a hard-to-find sys op with some software engineering background. These DevOps are like network engineers in the old days. They are endangered species.
处理系统和数据库管理的经验丰富的操作人员一直很难找到。自从我在90年代第一次创业以来,我被迫断断续续地扮演这些角色,这一事实说明了很多。随着越来越多的基于云的服务的出现和模糊软件开发和网络运营世界之间的界限,DevOps诞生了。他们不是支持非技术用户,而是支持高技术软件工程师。所以,现在我们谈论的是一个很难找到的具有一些软件工程背景的系统操作。这些DevOps就像过去的网络工程师。它们是濒临灭绝的物种。

What about quants?
量化呢?

The recent Big Data analytics movement has spiked a sudden demand of data scientists, a.k.a. quants. Quants possess domain expertise in quantitative, statistical analysis and machine learning algorithms that are crucial to businesses that need to digest their ever growing Big Data. Often times, an advanced degree in a natural science discipline such as physics or mathematics is required to qualify for such jobs, although other disciplines like mechanical engineering, operations research also prove highly applicable.
最近的大数据分析运动突然增加了对数据科学家(又名量化科学家)的需求。Quants拥有定量,统计分析和机器学习算法方面的领域专业知识,这对于需要消化不断增长的大数据的企业至关重要。通常,需要物理或数学等自然科学学科的高级学位才能获得此类工作的资格,尽管机械工程、运筹学等其他学科也被证明非常适用。

Despite the demanding requirement, I was having less trouble finding qualified quants than veteran software engineers. Perhaps demand for quants is still relatively fresh and not many companies know how best to tackle it yet. Coming from a natural science academic background myself, it was also a bit troubling to find out that a quant with a PhD in Physics from MIT and 5 years of post-Doctoral work costs less than a software engineer with a BS in Computer Science from an average college and 5 years of programming experience.
尽管要求苛刻,但我找到合格的定量人员比资深软件工程师更难。也许对量化的需求仍然相对较新,还没有多少公司知道如何最好地解决这个问题。我自己来自自然科学学术背景,发现一个拥有麻省理工学院物理学博士学位和 5 年博士后工作的量化分析师的成本也低于拥有普通大学计算机科学学士学位和 5 年编程经验的软件工程师。

Where can I find them?
我在哪里可以找到它们?

To build or grow your engineering team, if you do not already have at least a couple of trusted lieutenants and engineers as part of the team’s backbone, you will surely be up against a pretty big challenge. Unfortunately, that’s not an uncommon situation. As you’re advancing your career throughout the years, those who were once your star team members might’ve grown to playing similar role to yours or starting their own ventures (and competing with you for talents). So, chances are that there are some critical roles you need to fill from time to time.
为了建立或发展您的工程团队,如果您还没有至少几个值得信赖的副手和工程师作为团队骨干的一部分,那么您肯定会面临一个相当大的挑战。不幸的是,这种情况并不少见。随着你多年来职业生涯的发展,那些曾经是你的明星团队成员的人可能会成长为扮演与你类似的角色,或者开始自己的事业(并与你竞争人才)。因此,您可能需要不时填补一些关键角色。

Conventional wisdom suggests that hiring through internal referrals is always preferred. That is still true, as ever. It also makes sense to expand it further to connect with your friends, alumni, excolleagues, advisory board and board of directors for more leads. While it’s hard to guarantee success, posting jobs on job boards and professional social networks such as Monster and LinkedIn remains logical steps to advertise your hiring needs. One can also try local community networks like Craigslist especially if you only want local candidates.
传统观点认为,通过内部推荐招聘总是更可取的。与以往一样,这仍然如此。进一步扩展它以与您的朋友、校友、前同事、顾问委员会和董事会联系以获得更多线索也是有意义的。虽然很难保证成功,但在招聘委员会和专业社交网络(如 Monster 和 LinkedIn)上发布职位仍然是宣传您的招聘需求的合乎逻辑的步骤。人们也可以尝试像Craigslist这样的当地社区网络,特别是如果你只想要本地候选人。

For projects with well-defined specs and clear metric for measuring success, using less expensive off-shore resources may make sense. Cost for a near-shore full-time engineer is between one-third and half of that of a local engineer, but one should factor in the extra management cost incurred. Near-shore has an advantage of more synchronous time zones. Trusted references about their service quality are essential in your evaluation process.
对于具有明确规范和明确衡量成功的指标的项目,使用较便宜的离岸资源可能是有意义的。近岸全职工程师的成本在当地工程师的三分之一到一半之间,但应该考虑到产生的额外管理成本。近岸具有更同步时区的优势。在您的评估过程中,有关其服务质量的可信参考资料至关重要。

As for quants, there might not be many candidates available for hiring through the above channels. Because academic specialty weighs a lot in a quant job’s requirement, it makes sense to try acquire those talents directly from the academia. NACELink (http://www.nacelink.com/) is a great starting point for advertising your need through their extensive school network.
至于量化,通过上述渠道招聘的候选人可能并不多。由于学术专业在量化工作的要求中占很大比重,因此尝试直接从学术界获得这些人才是有意义的。NACELink( http://www.nacelink.com/ )是通过他们广泛的学校网络宣传您的需求的绝佳起点。

Recruiters?

If you foresee an on-going hiring need, you should get help from technical recruiters. They work on a contingency or retained basis. The latter is often preferred when you have a relatively large number of job openings to be filled in the short term. It’s recommended that a goal with a timeline be set upfront to measure success so as to keep precious time and budget in control. On the other hand, there is no reason for you to limit your hiring channel to only one type of recruiters.
如果您预见到持续的招聘需求,您应该从技术招聘人员那里获得帮助。它们在应急或保留的基础上工作。当您在短期内需要填补相对大量的职位空缺时,后者通常是首选。建议预先设定一个带有时间表的目标来衡量成功,从而控制宝贵的时间和预算。另一方面,您没有理由将您的招聘渠道限制为仅一种类型的招聘人员。

Finding competent technical recruiters is tricky, as you wouldn’t know whether a recruiter is good or not till you have the chance to at least go through a couple of leads from him/her. For each job opening, it’s recommended that you always provide your recruiters a carefully thought-out set of technical questions for them to go over as the initial filter. A couple of reasons for that:
找到称职的技术招聘人员很棘手,因为在您有机会至少从他/她那里获得一些线索之前,您不会知道招聘人员是否优秀。对于每个职位空缺,建议您始终向招聘人员提供一组经过深思熟虑的技术问题,让他们作为初始过滤器进行检查。有几个原因:

1. Nobody knows better than yourself what exactly the expertise you want from the candidate
1. 没有人比你自己更清楚你到底想从候选人那里获得什么专业知识

2. Many technical recruiters don’t necessarily have strong technical background, despite their technical title
2. 许多技术招聘人员尽管拥有技术职称,但不一定具有强大的技术背景

Conclusion: No magic pills
结论:没有神奇的药丸

Finding top talents to join your engineering team can be an exhausting effort, fruitless at times. You have to invest a huge amount of your own time and effort, even if you use recruiters. The fact is that there are great techies who aren’t good at marketing themselves and mediocre ones with excellent profiles on paper. People talk about the 80-20 rule, but you’ll be considered very lucky if 20% of your team are truly top talents.
寻找顶尖人才加入您的工程团队可能是一项令人筋疲力尽的工作,有时徒劳无功。即使您使用招聘人员,您也必须投入大量的时间和精力。事实是,有些伟大的技术人员不擅长推销自己,而平庸的技术人员在纸面上拥有出色的个人资料。人们谈论80-20规则,但如果你的团队中有20%的人是真正的顶尖人才,你会被认为是非常幸运的。

A significant part in the process of acquiring talents involves some selling effort. While selling the company’s prospect and evangelizing the adopted technologies help, many engineering veterans nowadays are sophisticated enough to have done their homework. So, the key is not to oversell them and be consistent among the “sellers”. It’s kind of analogous to reporting to your investors in a board meeting – just highlight the key achievements that count and back them by unambiguous metrics. They already did their homework. Even if not, you should assume they did.
在获取人才的过程中,很大一部分涉及一些销售工作。虽然推销公司的前景和宣传所采用的技术有所帮助,但如今许多工程资深人士已经足够老练,可以做功课。因此,关键是不要超卖它们,并在“卖家”之间保持一致。这有点类似于在董事会会议上向投资者报告——只需强调关键成就,并通过明确的指标来支持它们。他们已经做了功课。即使没有,你也应该假设他们做到了。

2 thoughts on “The War For Software Talent
关于“软件人才之战”的 2 条思考

  1. Pingback: Tech Resumes Today and Tomorrow | Genuine Blog
    pingback:技术在今天和明天恢复 |正版博客

  2. Pingback: Challenges Of Big Data + SaaS + HAN | Genuine Blog
    回调:大数据 + SaaS + HAN 的挑战 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Programming Exercise – Binary Tree
编程练习 – 二叉树

Like sorting algorithms, binary tree implementation is another good programming exercise. In particular, methods for traversing a tree, searching nodes in a binary search tree and many other binary tree operations form a great recipe for refreshing one’s programming skills. Appended is an implementation of a pretty comprehensive set of binary tree (and binary search tree) operations in Java.
与排序算法一样,二叉树实现是另一个很好的编程练习。特别是,遍历树的方法,在二叉搜索树中搜索节点以及许多其他二叉树操作构成了刷新编程技能的绝佳方法。附加是 Java 中一组非常全面的二叉树(和二叉搜索树)操作的实现。

Iterative and recursive methods for each of the operations are developed and grouped in two separate classes, BinTreeI and BinTreeR. In general, most operations are easier to be implemented using recursive methods. For instance, calculating tree height using iterative method is a rather non-trivial exercise whereas it’s almost an one-liner using recursion. Some of the operations such as searching and inserting a tree node are only applicable to binary search tree (BST), for which in-order tree traversal should be used. For generality, pre-order and post-order traversal methods are also included in the code.
每个操作的迭代和递归方法被开发并分组到两个单独的类中,BinTreeI 和 BinTreeR。通常,大多数操作使用递归方法更容易实现。例如,使用迭代方法计算树高是一项相当不平凡的练习,而它几乎是使用递归的单行。某些操作(例如搜索和插入树节点)仅适用于二叉搜索树 (BST),为此应使用按顺序树遍历。为了通用性,代码中还包括前序和后序遍历方法。

Similar to the implementation of sorting algorithms in a previous blog, Java Generics and Comparable interface are used. If wanted, the underlying tree node could be further expanded to contain more complex node data such as map entries (e.g. with class type <Key extends Comparable<Key>, Value>, and Map.Entry<K,V> data).
与之前博客中排序算法的实现类似,使用了Java泛型和可比较接口。如果需要,可以进一步扩展底层树节点以包含更复杂的节点数据,例如映射条目(例如,类类型为<Key extensions Comparable,<Key>Value>和Map.Entry<K,V>数据)。

Node.java – binary tree node
节点.java – 二叉树节点

BinTree.java – base binary tree class
BinTree.java – 基本二叉树类

BinTreeR.java – binary tree class using recursive methods
BinTreeR.java – 使用递归方法的二叉树类

BinTreeI.java – binary tree class using iterative methods
BinTreeI.java – 使用迭代方法的二叉树类

BinTreeMain.java – test application
BinTreeMain.java – 测试应用程序

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

NIO-based Reactor

Over the past few years, event-based architecture with non-blocking operations has been the norm for high-concurrency server architecture. The per-connection threading (process-based) architecture is no longer favored as an efficient design, especially for handling high volume of concurrent connections. The increasing popularity of Nginx and the relative decline of Apache httpd these days demonstrated the trend.
在过去几年中,具有非阻塞操作的基于事件的体系结构已成为高并发服务器体系结构的规范。每连接线程(基于进程)体系结构不再被视为高效设计,尤其是在处理大量并发连接时。如今,Nginx的日益普及和Apache httpd的相对衰落证明了这一趋势。

Java New I/O

Java’s NIO (New I/O, a.k.a. Non-blocking I/O) provides a set of APIs to efficiently handle I/O operations. The key ingredients of NIO include Buffer, Channel and Selector. A NIO Buffer virtually provides direct access to the operating system’s physical memory along with a rich set of methods for alignment and paging of the selected memory that stores any primitive-type data of interest. A NIO Channel then serves as the conduit for bulk data transfers between the Buffer and the associated entity (e.g. a socket).
Java的NIO(New I/O,又名非阻塞I/O)提供了一组API来有效地处理I/O操作。NIO的关键成分包括缓冲器,通道和选择器。NIO Buffer 实际上提供了对操作系统物理内存的直接访问,以及一组丰富的方法,用于对齐和分页存储任何感兴趣的基元类型数据的选定内存。然后,NIO 通道充当缓冲区和相关实体(例如套接字)之间批量数据传输的管道。

A socket channel can be configured in non-blocking mode and events such as reading data from the associated socket no longer block the invoking thread for more time than necessary. Together with the NIO Selector responsible for selecting those of the concurrent events that are ready to be processed, NIO APIs are well equipped to handle event-based operations in an efficient fashion.
套接字通道可以配置为非阻塞模式,并且从关联的套接字读取数据等事件不再阻塞调用线程超过必要的时间。与负责选择准备处理的并发事件的NIO选择器一起,NIO API能够有效地处理基于事件的操作。

Non-blocking vs Asynchronous
非阻塞与异步

Note that non-blocking mode is different from asynchronous mode. In non-blocking mode, a requested operation always returns the result immediately regardless of success or failure, thus freeing the invoking thread from being blocked. In asynchronous mode, a separate thread is used to carry out the requested operation in parallel with the invoking thread. Java 7 enhanced NIO to include support for asynchronous file and socket channels.
请注意,非阻塞模式不同于异步模式。在非阻塞模式下,无论成功还是失败,请求的操作始终会立即返回结果,从而使调用线程免于阻塞。在异步模式下,使用单独的线程与调用线程并行执行请求的操作。Java 7增强了NIO,包括对异步文件和套接字通道的支持。

Reactor pattern

The Reactor pattern is a popular event-based architecture. Using NIO, implementing a basic event-based server on top of Reactor pattern is pretty straight forward. Appended is a bare minimal Reactor-pattern server consisting of a Reactor class and a Handler class.
反应器模式是一种流行的基于事件的体系结构。使用 NIO,在 Reactor 模式之上实现一个基本的基于事件的服务器非常简单。附加的是一个由反应器类和处理程序类组成的最小反应器模式服务器。

The single-threaded Reactor class houses the main dispatcher loop responsible for selecting registered events that are ready for socket read/write operations. Registered with the dispatcher during initialization, the also single-threaded Acceptor is responsible for accepting socket connections requested by clients. Finally, the Handler class takes care of the actual events (read from socket, process data, write to socket) in accordance with its operational state.
单线程 Reactor 类包含主调度程序循环,负责选择已准备好进行套接字读/写操作的已注册事件。在初始化期间向调度程序注册,也是单线程接受器负责接受客户端请求的套接字连接。最后,Handler 类根据其操作状态处理实际事件(从套接字读取、处理数据、写入套接字)。

Each Handler is associated with a SocketChannel and the Selector maintained by the Reactor class. Both variables are declared immutable for performance as well as allowing access by the inner Runnable class. The handler registers with the dispatcher indicating its interested operation (read or write) and gets dispatched when the associated socket is ready for the operation. The Runnable class forms the worker thread pool and is responsible for data processing (in this simple case, echoing), leaving the Handler thread responsible for just socket read/write.
每个处理程序都与一个套接字通道和由反应器类维护的选择器相关联。这两个变量都声明为不可变的性能,并允许内部 Runnable 类访问。处理程序向调度程序注册,指示其感兴趣的操作(读取或写入),并在关联的套接字准备好执行操作时进行调度。Runnable 类形成工作线程池,负责数据处理(在这个简单的情况下,回显),而 Handler 线程只负责套接字读/写。

To test the server, just launch it on a host (e.g. server1.example.com) and run a few telnet instances connecting to the server port (e.g. telnet server1.example.com 9090).
要测试服务器,只需在主机上启动它(例如 server1.example.com),然后运行几个连接到服务器端口的 telnet 实例(例如 telnet server1.example.com 9090)。

Source code: Reactor.java
源代码:反应堆.java

Source code: Handler.java
源代码:处理程序.java

5 thoughts on “NIO-based Reactor
关于“蔚来反应堆”的 5 条思考

  1. A
    A 六月 10, 2014 9:31 上午

    This is very helpful. Thanks for sharing.
    这是非常有帮助的。感谢分享。

    Reply
    回复 ↓
  2. Pingback: Java 기반 서버를 만들기 위한 기술 조사 | Trying to Be Artist.
    Pingback:创建基于 Java 的服务器的技术调查 |Trying to Be Artist.

  3. Pingback: A Brief Encounter With Docker | Genuine Blog
    Pingback: 与 Docker 的短暂邂逅 |正版博客

  4. Pingback: NIO-based Reactor in Scala - Genuine Blog
    Pingback:斯卡拉中基于NIO的反应堆 - 真正的博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Tech Resumes Today and Tomorrow
技术在今天和明天恢复

Almost anyone who works for a living has a resume and maintains it with some effort. Due to the private nature and changeability of resume content, most people author their own resumes. Whether a resume is diligently adorned from time to time or reluctantly updated only to follow the social norm, it tells a lot about its author – often more than what the author intends to.
几乎任何以工作为生的人都有一份简历,并付出了一些努力来维护它。由于简历内容的私密性和可变性,大多数人都会自己撰写简历。无论简历是不时刻苦地装饰,还是为了遵循社会规范而不情愿地更新,它都能讲述很多关于作者的信息——往往比作者的意图要多。

While it’s fine to borrow layout or even writing styles from others, the content of your resume would better be your own work. Your resume represents so much about your accumulated skills and professional accomplishment over some of your most productive years that it’s justifiably worth some real effort of your own to assemble it. More importantly you’re going to back the words written on your resume during the job interviews and it’s easier to back your own words than others. However, before you finalize the “production version” of your resume it’s a good idea to solicit feedback from your trusted friends and refine it as you see fit.
虽然从别人那里借用布局甚至写作风格是可以的,但简历的内容最好是你自己的作品。你的简历在很大程度上代表了你在最富有成效的几年里积累的技能和专业成就,所以值得你自己付出一些真正的努力来组装它。更重要的是,在求职面试中,你会支持简历上写的话,而且支持你自己的话比别人更容易。但是,在最终确定简历的“生产版本”之前,最好征求您信任的朋友的反馈并根据需要对其进行完善。

Resume screening and deep-dive
简历筛选和深入探讨

In the software talent acquisition process, resume review is one of the most boring but important tasks. Getting the most out of a bunch of word streams of various writing styles certainly demands effort, and the ongoing fierce competition for software talent especially in the Silicon Valley helps ensure it’s a lengthy, if not endless, effort.
在软件人才获取过程中,简历审核是最枯燥但最重要的任务之一。充分利用一堆各种写作风格的单词流当然需要努力,而对软件人才的持续激烈竞争,尤其是在硅谷,这有助于确保这是一项漫长而无休止的努力。

Generally there is a process in which the recruiter screens quantity resumes to filter out the obvious unmatched, and another process involving a deep dive into the stream of words to try to compose, however superficially it might be, a virtual representative of the resume owner. Resume screening happens upfront and word stream diving happens typically after the initial scans, phone or early round interviews.
通常,招聘人员会筛选数量简历以过滤掉明显不匹配的内容,而另一个过程涉及深入研究单词流,以尝试组成简历所有者的虚拟代表,无论多么肤浅。简历筛选是预先进行的,词流潜水通常在初始扫描、电话或早期面试之后进行。

During resume screening, usually only technical keywords relevant to the very job opening and a rough sense of current career state and professional seniority will be extracted in a cursory fashion. The Professional Summary section weighs a lot in this process as each resume only gets a sub-minute glance window due to volume. The more in-depth resume evaluation involves digging into the individual jobs listed in the resume. Besides the information intentionally conveyed by the author to the audience, the reviewer might also try to read between the lines to deduce what expertise the job candidate might actually possess. Whether the evaluation process is implicit or well defined, a good chunk of the following will be compiled in some way:
在简历筛选过程中,通常只会粗略地提取与职位空缺相关的技术关键字以及对当前职业状态和专业资历的粗略了解。在这个过程中,专业摘要部分很重要,因为每份简历由于数量的原因,只有亚分钟的浏览窗口。更深入的简历评估涉及深入研究简历中列出的各个工作。除了作者有意向观众传达的信息外,审稿人还可能尝试从字里行间阅读,以推断求职者可能实际拥有哪些专业知识。无论评估过程是隐式的还是定义良好的,都将以某种方式编译以下内容的很大一部分:

  • Current and most recent job positions/ranks/responsibilities
    当前和最近的职位/级别/职责
  • Has the candidate been with highly regarded companies in recent years?
    该候选人近年来是否在备受推崇的公司工作?
  • Total years of professional experience at the sought level
    所寻求水平的专业经验总年数
  • Is the candidate still playing a hands-on role?
    候选人是否仍在发挥实际作用?
  • Candidate’s academic major and highest degree attained
    候选人的学术专业和获得的最高学位
  • Did the candidate graduate from a reputable or preferably Ivy League school?
    候选人是否毕业于信誉良好的常春藤盟校?
  • Does the candidate have a progressive career history?
    候选人是否有进步的职业经历?
  • Average duration of individual jobs in the past
    过去单个作业的平均持续时间

A capable technical recruiter can help carry out quality resume screening work and perhaps part of the in-depth evaluation for the hiring manager. But even with a well-prepared reader’s digest provided by the recruiter, the hiring manager ultimately has to dedicate his/her own bandwidth to at least read through the resume which is supposedly the source of data directly from the candidate.
一个有能力的技术招聘人员可以帮助进行高质量的简历筛选工作,也许是对招聘经理进行深入评估的一部分。但是,即使招聘人员提供了精心准备的读者摘要,招聘经理最终也必须将自己的带宽用于至少通读简历,而简历应该是直接来自候选人的数据来源。

Polished vs Crude
抛光与原油

Software engineers generally are not the best marketers. I’ve seen many resumes littered with numerous boldfaced keywords throughout the content resulting in a blob of randomly cluttered text. Sadly, often times the cluttering decoration is actually the work by head-hunters who try to impress the hiring managers with job matching keywords. Some resumes are downright fraudulent work. The worst ones show clear evidence of poorly automated fabrication of clause-matching skillsets to the specific job post.
软件工程师通常不是最好的营销人员。我看到许多简历在整个内容中散落着许多粗体关键字,导致一团随机混乱的文本。可悲的是,很多时候,杂乱无章的装饰实际上是猎头的工作,他们试图用工作匹配关键词给招聘经理留下深刻印象。有些简历是彻头彻尾的欺诈性工作。最糟糕的案例显示出明显的证据表明,与特定职位匹配的技能组合自动化程度很低。

A resume modestly revealing exceptional technical expertise in simple concise writing style often gets the highest respect. Hard-core software veterans tend to project an image of raw, no-non-sense personality, often along with a dose of attitude. Many would prefer to keep their resumes less well-packaged even if they’re capable of being so. Most of the time that dose of attitude is just a reflection of high confidence. However, sometimes an excessively righteous tone, for instance, can be an indication of a narcissistic non-team player. Whether that dose of attitude is healthy or excessive, one will surely find out during the in-person interviews.
一份简历,以简单简洁的写作风格谦虚地揭示了卓越的技术专长,通常会得到最高的尊重。硬核软件老手倾向于表现出原始、严肃的个性形象,通常伴随着一定的态度。许多人宁愿让他们的简历包装得不那么好,即使他们有能力这样做。大多数时候,这种态度只是高度自信的反映。然而,有时过于正义的语气,例如,可能表明一个自恋的非团队合作者。无论这种态度是健康的还是过度的,人们肯定会在面对面的面试中发现。

The hiring ecosystem
招聘生态系统

I think the entire hiring ecosystem today is very inefficient. You have job seekers wanting to trade their skills for the best available compensation package, and employers offering market-rate compensation in exchange for the skills. Both parties claim to be some of the best themselves, but neither of them trusts each other. Recruiters, head-hunters aren’t the unbiased middleman because they work one-sidedly for the employers who pay them and filling the job openings ASAP is their only priority, instead of finding the best match. Job boards also operate favorably for the employers who fund their revenue. Same for professional social networking sites such as LinkedIn whose main revenue comes from selling analytics data to companies.
我认为今天的整个招聘生态系统效率非常低。求职者希望用他们的技能换取最好的薪酬待遇,而雇主则提供市场价格的薪酬以换取技能。双方都声称自己是最好的,但他们都不信任对方。招聘人员、猎头公司不是公正的中间人,因为他们片面地为支付工资的雇主工作,尽快填补职位空缺是他们唯一的优先事项,而不是找到最佳匹配。工作委员会也对为其收入提供资金的雇主有利。专业社交网站也是如此,例如LinkedIn,其主要收入来自向公司出售分析数据。

Such one-sidedness is not necessarily a problem. In trading, you also have many brokers playing a one-sided middleman role. But typical products being traded have well-defined specifications and/or pricing standards within the product space. In hiring, you’re trading intangible skills. There is no common specifications or standards for skills that both the employers and job seekers can use as references.
这种片面性不一定是问题。在交易中,您也有许多经纪人扮演着单方面的中间人角色。但是,交易的典型产品在产品空间内具有明确的规格和/或定价标准。在招聘中,您正在交易无形技能。雇主和求职者都可以用作参考的技能没有共同的规范或标准。

Theoretically, trading your skills for compensation should be a fair game, but in reality, unless you possess certain skills that are in high demand at the time, employers usually have the upper hand perhaps because a majority of workers are perceived replaceable commodity. And evidently, even high-demand skills change from time to time. Unfortunately I don’t see how this one-sidedness will change in the foreseeable future.
从理论上讲,用你的技能换取报酬应该是一场公平的游戏,但实际上,除非你拥有某些当时需求量很大的技能,否则雇主通常会占上风,也许是因为大多数工人被认为是可替代的商品。显然,即使是高需求的技能也会不时变化。不幸的是,在可预见的未来,我看不出这种片面性会如何改变。

The future of tech resumes
科技的未来复苏

Today, composing a resume is largely, if not wholly, a marketing exercise. Had there been a set of common specifications of skills, assembling a resume would be more like an accounting exercise in which skills and experience are being logged in accordance with some standard weighing matrix. Resumes would then be a much more objective source for qualification data. Without some sort of skill measuring standard, employers will continue to come up with their wishful job requirement and job seekers will keep assembling their resumes in their own writing styles and with subjectively rated skill levels. As a result, skill match between a given job post and a resume is almost always superficial or accidental.
今天,撰写简历在很大程度上(如果不是完全的话)是一种营销活动。如果有一套通用的技能规格,那么整理简历将更像是一种会计练习,其中技能和经验是根据一些标准的称重矩阵记录的。然后,简历将成为资格数据的一个更客观的来源。如果没有某种技能衡量标准,雇主将继续提出他们一厢情愿的工作要求,求职者将继续以自己的写作风格和主观评估的技能水平来组装他们的简历。因此,给定职位和简历之间的技能匹配几乎总是肤浅或偶然的。

What is a practical rating method for skills is the million-dollar question here. Peeping into the not-too-far future, I suspect there is going to be some standard-based semantic foundation on top of which job history and academic achievement can be systematically rated. In addition to that, perhaps some credential scoring systems similar to StackOverflow.com’s model can also be used in the rating methodology.
什么是实用的技能评级方法是这里的百万美元问题。窥视不太遥远的未来,我怀疑会有一些基于标准的语义基础,在此基础上可以系统地评估工作经历和学术成就。除此之外,也许一些类似于StackOver.com flow模型的凭证评分系统也可以用于评级方法。

All that would require an underlying layer of some sort of standard software engineering ontology (e.g. “ISO/IEC/IEEE 24765: Systems and Software Engineering Vocabulary”) so that all the job functions and skillsets logged in a resume have referential meanings. The raw content of a resume would be composed in a format suitable for machine interpretation (e.g. Resource Description Framework a.k.a. RDF, Semantic Web). As to the presentation-layer tools, some readily available reader or browser would allow a human to interactively query the latest information in any area of interest within a resume in various levels of granularity and perform ad-hoc analysis and qualification comparison among competing candidates. Job posts would also be structuralized in accordance with the same underlying semantics, making matching job seekers with employers more of science than art.
所有这些都需要某种标准软件工程本体的底层(例如“ISO/IEC/IEEE 24765:系统和软件工程词汇”),以便简历中记录的所有工作职能和技能都具有参考意义。简历的原始内容将以适合机器解释的格式组成(例如 资源描述框架又名RDF,语义网)。至于表示层工具,一些现成的阅读器或浏览器将允许人们以各种粒度级别交互式查询简历中任何感兴趣领域的最新信息,并在竞争候选人之间进行临时分析和资格比较。职位也将按照相同的基本语义进行结构化,使求职者与雇主的匹配更像是科学而不是艺术。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Big Data SaaS For HAN Devices
适用于 HAN 设备的大数据 SaaS

At one of the startups I co-founded in recent years, I was responsible for building a SaaS (Software-as-a-Service) platform to manage a rapidly growing set of HAN (Home Area Network) devices across multiple geographies. It’s an interesting product belonging to the world of IoT (Internet of Things), a buzzword that wasn’t popular at all back in 2007. Building such a product required a lot of adventurous exploration and R&D effort from me and my team, especially back then when SaaS and HAN were generally perceived as two completely segregated worlds. The company is EcoFactor and is in the energy/cleantech space.
在我近年来共同创立的一家初创公司中,我负责构建一个SaaS(软件即服务)平台,以管理跨多个地区的快速增长的HAN(家庭局域网)设备集。这是一个属于物联网(IoT)世界的有趣产品,这是一个流行语,在2007年根本不流行。构建这样的产品需要我和我的团队进行大量的冒险探索和研发工作,尤其是在当时,SaaS和HAN通常被认为是两个完全隔离的世界。该公司是EcoFactor,在能源/清洁技术领域。

Our goal was to optimize residential home energy, particularly in the largely overlooked HVAC (heating, ventilation, and air conditioning) area. We were after the consumer market and chose to leverage channel partners in various verticals including HVAC service companies, broadband service providers, energy retailers, to reach mass customers. Main focus was twofold: energy efficiency and energy load shaping. Energy efficiency is all about saving energy while not significantly compromising comfort, and energy load shaping primarily targets utility companies who have vast interest in reducing spikes in energy load during usage peak-time.
我们的目标是优化住宅能源,特别是在在很大程度上被忽视的HVAC(供暖,通风和空调)区域。我们瞄准消费市场,选择利用包括暖通空调服务公司、宽带服务提供商、能源零售商在内的各个垂直领域的渠道合作伙伴来接触大众客户。主要关注点有两个:能源效率和能源负荷整形。能源效率就是在不影响舒适度的情况下节约能源,能源负荷整形主要针对那些对减少使用高峰期能源负荷峰值非常感兴趣的公用事业公司。

Home energy efficiency
家庭能源效率

Energy efficiency implementation requires intelligence derived from mass real-world data and delivered by optimization algorithms. Proper execution of such optimization isn’t trivial. It involves deep learning of HVAC usage pattern in a given home, analysis of the building envelope (i.e. how well-insulated the building is), the users’ thermostat control activities, etc. All that information is deduced from the raw thermal data spit out by the thermostats, without needing to ask the users a single question. And execution is in the form of programmatic refinement through learning over time as well as interactive adjustment in accordance with feedback from ad-hoc activities.
能源效率的实施需要从大量真实世界数据中获得智能,并通过优化算法提供。正确执行此类优化并非易事。它涉及对给定家庭中HVAC使用模式的深度学习,对建筑围护结构的分析(即建筑物的隔热程度),用户的恒温器控制活动等。所有这些信息都是从恒温器吐出的原始热数据中推断出来的,无需向用户提出一个问题。执行的形式是通过随着时间的推移学习以及根据临时活动的反馈进行交互式调整来完善计划。

Obviously, local weather condition and forecast information is another crucial input data source for executing the energy efficiency strategy. Besides temperature, other information such as solar/radiation condition and humidity are also important parameters. There are quite a lot of commercial weather datafeed services available for subscription, though one can also acquire raw data for U.S. directly from NCDC (National Climatic Data Center).
显然,当地的天气状况和预报信息是执行能源效率战略的另一个关键输入数据源。除温度外,太阳/辐射条件和湿度等其他信息也是重要参数。有相当多的商业天气数据馈送服务可供订阅,尽管人们也可以直接从NCDC(国家气候数据中心)获取美国的原始数据。

Energy load shaping
能量负载整形

Many utilities offer demand response programs, often with incentives, to curtail energy consumption during usage peak-time (e.g. late afternoon on a hot Summer day). Load reduction in a conventional demand response program inevitably causes discomfort experienced by the home occupants, leading to high opt-out rate that beats the very purpose of the program. Since the “thermal signature” of individual homes is readily available from the vast thermal data being collected around the clock, it didn’t take too much effort to come up with suitable load shaping strategy, including pre-conditioning, for each home to take care of the comfort factor while conducting a demand response program. Utility companies loved the result.
许多公用事业公司提供需求响应计划,通常带有激励措施,以减少使用高峰时段(例如炎热夏日的傍晚)的能源消耗。传统需求响应程序中的负载减少不可避免地会引起家庭居住者的不适,从而导致高选择退出率,这超出了该计划的目的。由于各个家庭的“热特征”可以从全天候收集的大量热数据中轻松获得,因此无需花费太多精力即可为每个家庭提出合适的负载整形策略,包括预调节,以便在进行需求响应计划时照顾舒适因素。公用事业公司喜欢这个结果。

HAN devices

The product functionality described so far seems to suggest that: a) some nonexistent complicated device communications protocol is needed, and, b) in-house hardware/firmware engineering effort is needed. Fortunately, there were already some WPAN (Wireless Personal Area Network) protocols, such as ZigBee/IEEE 802.15.4, Z-Wave, 6loWPAN (and other wireless protocols such as WiFi/IEEE 802.11.x), although implementations were still in experimentation at the time I started researching into that space.
到目前为止描述的产品功能似乎表明:a)需要一些不存在的复杂设备通信协议,以及,b)需要内部硬件/固件工程工作。幸运的是,已经有一些WPAN(无线个人局域网)协议,如ZigBee/IEEE 802.15.4、Z-Wave、6loWPAN(以及其他无线协议,如WiFi/IEEE 802.11.x),尽管在我开始研究该领域时,实现仍在实验中。

We wanted to stay in the software space (more specifically, SaaS) and focus more on delivering business intelligence out of the collected data, hence we would do everything we could to keep our product hardware- and protocol-agnostic. Instead of trying to delve into the hardware engineering world ourselves, we sought and adopted strategic partnership with suitable hardware vendors and worked collaboratively with them to build quality hardware to match our functionality requirement.
我们希望留在软件领域(更具体地说,SaaS),并更多地专注于从收集的数据中提供商业智能,因此我们将尽一切努力保持我们的产品与硬件和协议无关。我们没有试图自己深入研究硬件工程领域,而是寻求并与合适的硬件供应商建立战略合作伙伴关系,并与他们合作构建符合我们功能要求的优质硬件。

Back in 2007, the WPAN-based devices available on the market were too immature to be used even for field trials, so we started out with some IP-based thermostats each of which equipped with a stripped-down HTTP server. Along with the manufacturer’s REST-based device access service, we had our first-generation 2-way communicating thermostats for proof-of-concept work. Field trials were conducted simultaneously in both Texas and Australia so as to capture both Summer and Winter data at the same time. The trials were a success. In particular, the trial result answered the few key hypotheses that were the backbone of our value proposition.
早在 2007 年,市场上基于 WPAN 的设备还太不成熟,甚至无法用于现场试验,因此我们从一些基于 IP 的恒温器开始,每个恒温器都配备了精简的 HTTP 服务器。除了制造商基于 REST 的设备访问服务外,我们还拥有第一代 2 路通信恒温器,用于概念验证工作。在德克萨斯州和澳大利亚同时进行田间试验,以便同时捕获夏季和冬季数据。试验取得了成功。特别是,试验结果回答了作为我们价值主张支柱的几个关键假设。

WPAN vs WiFi
WPAN 与无线上网

To prepare ourselves for large-scale deployment, a low-cost barebone programmable thermostat one can find in a local hardware store like Home Depot is what we were going after as the base hardware. The remaining requirement would be to equip it with a low-cost chip that can communicate over some industry-standard protocol. An IP-based thermostat requiring running ethernet cable inside a house is out of question for both deployment cost and cosmetic reasons which we learned a great deal from our field trials. In essence, we only considered thermostats communicating over wireless protocols such as WPAN or WiFi.
为了准备大规模部署,我们可以在家得宝等当地五金店找到的低成本准系统可编程恒温器是我们追求的基础硬件。剩下的要求是为其配备低成本芯片,可以通过某些行业标准协议进行通信。基于IP的恒温器需要在房屋内运行以太网电缆,由于部署成本和外观原因,我们从现场试验中学到了很多东西。从本质上讲,我们只考虑了通过WPAN或WiFi等无线协议进行通信的恒温器。

Next, WPAN won over WiFi because of the relatively less work required for messing with the broadband network in individual homes and the low-power specs that works better for battery-powered thermostats. Finally, ZigBee became our choice for the first mass deployment because of its relatively robust application profiles tailored for energy efficiency and home automation. Another reason is that it was going to be the protocol SmartMeters would use, and communicating with SmartMeters for energy consumption information was in our product roadmap.
接下来,WPAN赢得了WiFi,因为扰乱单个家庭宽带网络所需的工作量相对较少,并且低功耗规格更适合电池供电的恒温器。最后,ZigBee 成为我们首次大规模部署的选择,因为它具有为能源效率和家庭自动化量身定制的相对强大的应用程序配置文件。另一个原因是,它将成为SmartMeter将使用的协议,与SmartMeter通信以获取能源消耗信息是在我们的产品路线图中。

ZigBee forms a low-power wireless mesh network in which nodes relay communications. At 250 kbit/s, it isn’t a high-speed protocol and can operate in the 2.4GHz frequency band. It’s built on top of IEEE 802.15.4 and is equipped with industry-standard public-key cryptography security. Within a ZigBee network, a ZigBee gateway device typically serves as the network coordinator device, responsible for enforcing the security policy in the ZigBee network and enrollment of joining devices. It connects via ethernet cable or WiFi to a broadband router on one end and communicates wirelessly with the ZigBee devices in the home. The gateway device in essence is the conduit to the associated HAN devices. Broadband internet connectivity is how these HAN devices communicate with our SaaS platform in the cloud. This means that we only target homes with broadband internet service.
ZigBee形成了一个低功耗的无线网状网络,其中节点中继通信。在250 kbit/s时,它不是高速协议,可以在2.4GHz频段内运行。它建立在 IEEE 802.15.4 之上,并配备了行业标准的公钥加密安全性。在 ZigBee 网络中,ZigBee 网关设备通常充当网络协调器设备,负责在 ZigBee 网络中实施安全策略并注册加入设备。它通过以太网电缆或WiFi连接到一端的宽带路由器,并与家中的ZigBee设备进行无线通信。网关设备本质上是关联 HAN 设备的管道。宽带互联网连接是这些 HAN 设备与云中的 SaaS 平台进行通信的方式。这意味着我们只针对提供宽带互联网服务的家庭。

The SaaS platform
SaaS 平台

Our very first SaaS prototype system prior to VC funding was built on a LAMP platform using first-generation algorithms co-developed by a small group of physicists from academia. We later rebuilt the production version on the Java platform using a suite of open-source application servers and frameworks supplemented with algorithms written in Python. Heavy R&D of optimization strategy and machine learning algorithms were being performed by a dedicated taskforce and integrated into the “brain” of the SaaS platform. A suite of selected open-source software including application servers and frameworks were adopted along with tools for development, build automation, source control, integration and QA.
在风险投资之前,我们的第一个SaaS原型系统是在LAMP平台上构建的,使用由学术界的一小群物理学家共同开发的第一代算法。后来,我们使用一套开源应用程序服务器和框架在Java平台上重建了生产版本,并辅以用Python编写的算法。优化策略和机器学习算法的大量研发由一个专门的工作组执行,并集成到SaaS平台的“大脑”中。采用了一套选定的开源软件,包括应用程序服务器和框架,以及用于开发、构建自动化、源代码控制、集成和 QA 的工具。

Relational databases were set up initially to persist acquired data from the HAN devices in homes across the nation (and beyond). The data acquisition architecture was later revamped to use HBase as a fast data dumping persistent store to accommodate the rapidly growing around-the-clock data stream. Only selected data sets were funneled to the relational databases for application logics requiring more complex CRUD (create, read, update and delete) operations. Demanding Big Data mining, aggregation and analytics tasks were performed on Hadoop/HDFS clusters.
关系数据库最初是为了将HAN设备获取的数据保存在全国(及其他地区)的家庭中。数据采集架构后来进行了改进,将HBase用作快速数据转储持久存储,以适应快速增长的全天候数据流。只有选定的数据集被汇集到关系数据库,用于需要更复杂的 CRUD(创建、读取、更新和删除)操作的应用程序逻辑。要求苛刻的大数据挖掘、聚合和分析任务是在Hadoop/HDFS集群上执行的。

Under the software-focused principle, our SaaS applications do not directly handle low-level communications with the gateway and thermostat devices. The selected gateway vendor provides its PaaS (Platform-as-a-Service) which takes care of M2M (machine to machine) hardware communications and exposes a set of APIs for basic device operations. The platform also maintains bidirectional communications with the gateway devices by means of periodic phone-home from devices and UDP source port keep-alive (a.k.a. hole-punching, for inbound communications through the firewall in a typical broadband router). Such separation of work allows us to focus on the high-level application logics and business intelligence. It also allows us to more easily extend our service to multiple hardware vendors.
在以软件为中心的原则下,我们的 SaaS 应用程序不直接处理与网关和恒温器设备的低级通信。所选网关供应商提供其 PaaS(平台即服务),负责 M2M(机器到机器)硬件通信,并为基本设备操作公开一组 API。该平台还通过定期从设备呼叫总部和UDP源端口保持活动状态(又名打孔,用于在典型宽带路由器中通过防火墙进行入站通信)来保持与网关设备的双向通信。这种工作分离使我们能够专注于高级应用程序逻辑和商业智能。它还使我们能够更轻松地将我们的服务扩展到多个硬件供应商。

Algorithms

Obviously I can’t get into any specifics of the algorithms which represents collective intellectual work developed and scrutinized by domain experts since the very beginning of the startup venture. It suffices to say that they constitute the brain of the SaaS application. Besides information garnered from historical data, the execution also takes into account of interactive feedback from the users (e.g. ad-hoc manual overrides of temperature settings on the thermostat via the up/down buttons or a mobile app for thermostat control) and modifies existing optimization strategy accordingly.
显然,我无法深入了解算法的任何细节,这些算法代表了自创业初期以来由领域专家开发和审查的集体智力工作。可以说它们构成了SaaS应用程序的大脑。除了从历史数据中获得的信息外,执行还考虑了用户的交互式反馈(例如,通过上/下按钮或用于恒温器控制的移动应用程序临时手动覆盖恒温器上的温度设置),并相应地修改现有的优化策略。

Lots of modeling and in-depth learning of real-world data were performed in the areas of thermal energy exchange in a building, HVAC run-time, thermostat temperature cycles, etc. A team of quants with strong background in Physics and numerical analysis were assembled to just focus on the relevant work. Besides custom optimization algorithms, machine learning algorithms including clustering analysis (e.g. k-Means Clustering) were employed for various kinds of tasks such as fault detection.
在建筑物中的热能交换、HVAC 运行时间、恒温器温度循环等领域对真实世界的数据进行了大量建模和深度学习。一个具有强大物理和数值分析背景的量化团队被召集起来,专注于相关工作。除了自定义优化算法外,还包括聚类分析(例如k均值聚类)在内的机器学习算法还用于各种任务,例如故障检测。

A good portion of the algorithmic programming work was done on the Python platform primarily for its abundance of contemporary Math libraries (SciPy, NumPy, etc). Other useful tools include R for programmatic statistical analysis and Matlab/Octave for modeling. For good reasons, the quant team is the group demanding the most computing resource from the Hadoop platform. And Hadoop’s streaming API makes it possible to maintain a hybrid of Java and Python applications. A Hadoop/HDFS cluster was used to accommodate all the massive data aggregation operations. On the other hand, a relational database with its schema optimized for the quant programs was built to handle real-time data processing, while a long-term solution using HBase was being developed.
算法编程工作的很大一部分是在Python平台上完成的,主要是因为它拥有丰富的当代数学库(SciPy,NumPy等)。其他有用的工具包括用于程序化统计分析的R和用于建模的Matlab/Octave。有充分的理由,量化团队是需要Hadoop平台计算资源最多的团队。Hadoop的流API使得维护Java和Python应用程序的混合成为可能。Hadoop/HDFS集群被用来容纳所有海量数据聚合操作。另一方面,构建了一个关系数据库,其模式针对定量程序进行了优化,以处理实时数据处理,同时正在开发使用HBase的长期解决方案。

Putting everything together
将所有内容放在一起

Although elastic cloud service such as Amazon’s EC2 has been hot and great for marketing, our around-the-clock data acquisition model consists of a predictable volume and steady stream rate. So the cloud’s elasticity wouldn’t benefit us much, but it’s useful for development work and benchmarking.
尽管亚马逊EC2等弹性云服务一直很热门,非常适合营销,但我们全天候的数据采集模型由可预测的数量和稳定的流速率组成。因此,云的弹性不会让我们受益匪浅,但它对开发工作和基准测试很有用。

Another factor is security, which is one of the most critical requirements in operating an energy management business. A malicious attack that simultaneously switches on 100,000 A/Cs in a metropolitan region on a hot Summer day could easily bring down the entire grid. Cloud computing service tends to come with less flexible security measure, whereas one can more easily implement enhanced security in a conventional hosting environment, and co-located hosting would offer the highest flexibility in that regard. Thus a decision was made.
另一个因素是安全性,这是运营能源管理业务的最关键要求之一。在炎热的夏日同时打开大都市地区的 100,000 台空调的恶意攻击很容易导致整个电网瘫痪。云计算服务往往具有不太灵活的安全措施,而人们可以在传统的托管环境中更容易地实现增强的安全性,并且托管将在这方面提供最高的灵活性。因此做出了决定。

That pretty much covers all the key ingredients of this interesting product that brings together the disparate SaaS and HAN worlds at a Big Data scale. All in all, it was a rather ambitious endeavor on both the business and technology fronts, certainly not without challenges – which I’ll talk about perhaps in a separate post some other time.
这几乎涵盖了这个有趣产品的所有关键成分,该产品以大数据规模汇集了不同的SaaS和HAN世界。总而言之,在业务和技术方面,这是一项相当雄心勃勃的努力,当然并非没有挑战——我可能会在另一篇文章中讨论。

2 thoughts on “Big Data SaaS For HAN Devices
关于“适用于 HAN 设备的大数据 SaaS”的 2 条思考

  1. Pingback: Challenges Of Big Data + SaaS + HAN | Genuine Blog
    回调:大数据 + SaaS + HAN 的挑战 |正版博客

  2. Pingback: Internet-of-Things And Akka Actors | Genuine Blog
    回溯:物联网和阿卡演员 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Thoughts On Technology Marketing
关于技术营销的思考

In the commercial sector, many technology companies succeed in their space with average technology wrapped with slick business strategy. Only a handful of them rise to the top by showcasing technology superior to their competitors. It’s safe to assert that a viable business model coupled with right execution remains the driving key success factor.
在商业领域,许多科技公司凭借普通技术与巧妙的商业战略相结合,在他们的领域取得了成功。他们中只有少数人通过展示优于竞争对手的技术而登上顶峰。可以肯定的是,可行的商业模式加上正确的执行仍然是驱动成功的关键因素。

Superior technology
卓越的技术

Nevertheless, superiority in technology helps get a company ahead of its competitors. The lead may not persists forever, but while it lasts the fame of being the best helps seize significant market share. Every once in a while, we see products and services backed by superior technology win over mass customers and crush competitors. Examples are Oracle’s database, Sun’s server hardware, Google’s search algorithm and Apple iPhone hardware/UI.
然而,技术优势有助于公司领先于竞争对手。领先可能不会永远持续下去,但在持续下去的同时,成为最好的名声有助于抓住重要的市场份额。每隔一段时间,我们就会看到以卓越技术为后盾的产品和服务赢得大众客户并碾压竞争对手。例如Oracle的数据库,Sun的服务器硬件,Google的搜索算法和Apple iPhone硬件/ UI。

On the other hand, no matter how great your product is, competitors tend to catch up with comparable technologies quickly. Superiority in technology alone isn’t sufficient for success, but it does gain respect from the technology community and a direct consequence is that the company is likely to attract top technology talent. Conversely, being widely perceived technologically inferior to its competitors would sooner or later cause the company to lose.
另一方面,无论您的产品有多棒,竞争对手都倾向于迅速赶上可比的技术。仅靠技术优势不足以取得成功,但它确实赢得了技术界的尊重,直接后果是公司可能会吸引顶尖技术人才。相反,被广泛认为在技术上不如竞争对手迟早会导致公司亏损。

There are lots of brilliant technologists, so emergence of great technological work isn’t rare. What’s rare is right execution with the right timing. From time to time, we see great ideas implemented at the “wrong time” like:
有很多才华横溢的技术专家,所以伟大的技术工作的出现并不罕见。罕见的是,在正确的时机正确执行。我们不时看到在“错误的时间”实施的好主意,例如:

  • Resource-intensive GUI on computers when CPU speed was at 5 MHz
    CPU 速度为 5 MHz 时计算机上的资源密集型 GUI
  • Cloud-based systems with common Internet connection at 56 kbit/s
    基于云的系统,具有 56 kbit/s 的通用互联网连接

Why technology marketing?
为什么要进行技术营销?

We do not have much control on the right timing, which is often prone to subjective interpretation plus a bit of luck. Semantic Web, Internet of Things are two examples that forward-thinking technologists started advocating more than a decade ago, yet they are nowhere near being widely adopted in their supposedly ubiquitous form. We do however have control on how to capitalize the using of internally developed technologies beyond building product. One approach is to publish or open-source selected technologies. Intentional or not, this is a marketing effort. Below are a few bullet points highlighting some of the benefits:
我们对正确的时机没有太多控制,这往往容易产生主观解释加上一点运气。语义网,物联网是具有前瞻性思维的技术专家十多年前开始倡导的两个例子,但它们远未以所谓的无处不在的形式被广泛采用。但是,我们可以控制如何在构建产品之外利用内部开发的技术。一种方法是发布或开源选定的技术。无论有意与否,这是一项营销工作。以下是一些要点,重点介绍了一些好处:

  • Project an image of being a technology pioneer
    树立技术先锋的形象
  • Good Samaritan, giving back to the technology community
    好撒玛利亚人,回馈科技社区
  • Benefit from the general collaborative open-source development effort
    受益于一般的协作式开源开发工作
  • Attract top technologists
    吸引顶尖技术人员
  • Get feedback for improvement from a broad technology community
    从广泛的技术社区获取改进反馈

Bottomline is that, almost everything especially in the commercial sector needs some marketing effort to shine. Technology is no difference. More importantly, marketing your product directly is inevitably met with normal skepticism as you’re supposed to talk up your own product, and the effect is short-lived as every business including your competitors is doing the same thing. Marketing the underlying technology of your product adds subtlety to the conventional product marketing effect which customers have long been numb to.
最重要的是,几乎所有东西,尤其是在商业领域,都需要一些营销努力才能大放异彩。技术没有什么不同。更重要的是,直接营销你的产品不可避免地会遇到正常的怀疑,因为你应该谈论自己的产品,而且效果是短暂的,因为包括你的竞争对手在内的每个企业都在做同样的事情。营销产品的底层技术为客户长期以来麻木的传统产品营销效果增添了微妙之处。

Publicizing technologies
宣传技术

Many technology companies have already been doing that:
许多科技公司已经在这样做:

  • Google published their Big Data work such as MapReduce and BigTable, released interface definition language Protocol Buffers, among many other things.
    Google发布了他们的大数据工作,如MapReduce和BigTable,发布了接口定义语言Protocol Buffers等等。
  • As another company that deals with data at the real Big Data level, Facebook gave out NoSQL database Cassandra, interface definition language Thrift, and Scribe for streamed data aggregation.
    作为另一家在真正的大数据级别处理数据的公司,Facebook提供了NoSQL数据库Cassandra,接口定义语言Thrift和Scribe用于流式数据聚合。
  • Yahoo still gets a lot of respect from the technology community not because of their search engine, email service or its media popular CEO, but their relevance in the Big Data technology space, particularly Hadoop.
    雅虎仍然受到技术界的尊重,不是因为他们的搜索引擎、电子邮件服务或其受欢迎的首席执行官,而是因为他们在大数据技术领域的相关性,尤其是Hadoop。
  • Twitter incubated distributed real-time streaming software, Storm.
    Twitter孵化了分布式实时流媒体软件Storm。
  • LinkedIn created high-performance distributed messaging software, Kafka.
    LinkedIn创建了高性能的分布式消息传递软件Kafka。
  • Netflix rolled out Java library, Curator, for ZooKeeper and a bunch of cloud-centric software.
    Netflix推出了Java库,Curator,ZooKeeper和一堆以云为中心的软件。
  • Meanwhile I’m not aware of any open-source contribution from Amazon, but the popularity of their EC2 platform made them a cloud service pioneer. The retail giant was hardly perceived a leading tech company before they expanded into the cloud service.
    同时,我不知道亚马逊有任何开源贡献,但他们EC2平台的普及使他们成为云服务的先驱。这家零售巨头在扩展到云服务之前几乎不被认为是一家领先的科技公司。

When not to publicize your technologies?
什么时候不宣传你的技术?

Publicizing your internally developed technologies isn’t necessarily a good move in all cases. It might not be a good idea to expose technologies to the public especially in the form of open-source if the technology:
在所有情况下,宣传您内部开发的技术并不一定是一个好举措。向公众公开技术可能不是一个好主意,尤其是以开源的形式,如果技术:

  • consists of your core business intellectual property (i.e. secret sauce)
    包括您的核心业务知识产权(即秘密调味料)
  • isn’t compliant with industry standards
    不符合行业标准
  • doesn’t work well on contemporary open-source platforms
    在当代开源平台上效果不佳
  • is just “yet another” ordinary implementation of certain technology
    只是某种技术的“又一个”普通实现
  • hasn’t been and won’t be used in some of your own products
    没有也不会在您自己的某些产品中使用
  • isn’t polished enough to give out to external technologists
    不够完善,无法提供给外部技术人员

Like any marketing effort, technology marketing takes significant resource. That’s why companies who can afford doing that are in general well-established with abundant engineering resource. However, even for smaller companies and startups, if there is marketable and shareable technological work along with the right expertise in-house, it’s still worth serious consideration to publicize it.
像任何营销工作一样,技术营销需要大量资源。这就是为什么有能力这样做的公司通常都拥有丰富的工程资源。然而,即使对于较小的公司和初创公司来说,如果内部有适销对路和可共享的技术工作以及正确的专业知识,仍然值得认真考虑宣传它。

1 thought on “Thoughts On Technology Marketing
关于“关于技术营销的思考”的 1 条思考

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Challenges Of Big Data + SaaS + HAN
大数据 + SaaS + HAN 的挑战

This is part two of a previous post about building and operating a Big Data SaaS for Home Area Network devices during my 5-year tenure with EcoFactor. Simply put, our main goal was to add “smarts” to residential heating and cooling systems (i.e. heaters and air conditioners, a.k.a. HVAC) via ordinary thermostats. That focus led to a superficial perception by some people that we’re a smart thermostat device company. In actuality, we have always been a software service, virtually agnostic to both hardware and communications protocol. It’s more of an IoT version of the “Intel Inside” business model.
这是上一篇文章的第二部分,内容涉及在我在EcoFactor任职的5年任期内为家庭局域网设备构建和运营大数据SaaS。简而言之,我们的主要目标是通过普通恒温器为住宅供暖和制冷系统(即加热器和空调,又名暖通空调)添加“智能”。这种关注导致一些人肤浅地认为我们是一家智能恒温器设备公司。实际上,我们一直是一个软件服务,几乎与硬件和通信协议无关。它更像是“Intel Inside”商业模式的物联网版本。

Challenges from all fronts
来自各个方面的挑战

Like building any startup company, there was a wide spectrum of challenges confronting us which is what this post is going to talk about. Funding environment was pretty hellish as we started just shortly before the financial crisis in 2007-2008. And failure of some high-profile solar companies in subsequent years certainly didn’t help make the once hyped cleantech a favorable sector for investors.
就像建立任何一家初创公司一样,我们面临着广泛的挑战,这就是这篇文章要讨论的内容。融资环境相当地狱般的,因为我们在2007-2008年金融危机前不久开始。随后几年,一些备受瞩目的太阳能公司的失败当然无助于使曾经被炒作的清洁技术成为投资者的有利行业。

The ever-growing fierce competition for software engineering talent was and has been a big challenge for pretty much every startup in the Silicon Valley. On the technology front, production-grade open-source Big Data technologies weren’t there, leading to the need for a lot of internal R&D effort by individual companies, which in turn requires domain experts in both development and operations who were scarce endangered species back then, thus completing the vicious infinite loop that starts with the hiring difficulty.
对软件工程人才的日益激烈的竞争过去和现在都是硅谷几乎所有初创公司面临的巨大挑战。在技术方面,生产级开源大数据技术并不存在,导致个别公司需要大量的内部研发工作,这反过来又需要开发和运营方面的领域专家,这些专家当时是稀缺的濒危物种,从而完成了从招聘难度开始的恶性无限循环。

Operational processes
操作流程

On the operational front, there was a long list of processes that need to be carefully established and managed – from user acquisition, on-boarding, device installer training, scheduling coordination for on-site device installation, technical support for installers, to customer service. To get into the details of how all that was done warrants writing a book. In charge of product and marketing, Scott Hublou who is also a co-founder of the company owned the “horrendous” list.
在运营方面,有一长串流程需要仔细建立和管理——从用户获取、入职、设备安装人员培训、现场设备安装的调度协调、安装人员的技术支持到客户服务。要深入了解这一切是如何完成的,需要写一本书。负责产品和营销的斯科特·胡布卢(Scott Hublou)也是该公司的联合创始人,他拥有“可怕的”名单。

Many of the items in the list are correlated. For instance, getting HVAC technicians to create a HAN network and pair up thermostats with the HAN gateway during an on-site installation not only required a custom-built software tool with a well-thoughtout workflow and easy UI, but also thorough training and a knowledgeable support team to back them up for ad-hoc troubleshooting.
列表中的许多项目都是相关的。例如,让暖通空调技术人员创建 HAN 网络并在现场安装期间将恒温器与 HAN 网关配对,不仅需要具有深思熟虑的工作流程和简单用户界面的定制软件工具,还需要全面的培训和知识渊博的支持团队来支持他们进行临时故障排除。

Back to the engineering side of the world, a key piece in operations is the technology infrastructure that needs to cope with future business growth. That includes systems hosting, network and data architecture, server clusters for distributed computing, load balancing systems, fail-over and monitoring mechanism, firewalls, etc. As a startup company, we started with something simple but expandable to conserve cash, and scaled up as quickly as necessary. That’s also a practical approach from the design point of view to avoid over-engineering.
回到世界的工程方面,运营中的一个关键部分是需要应对未来业务增长的技术基础设施。这包括系统托管、网络和数据架构、分布式计算的服务器集群、负载平衡系统、故障转移和监控机制、防火墙等。作为一家初创公司,我们从简单但可扩展的东西开始,以节省现金,并根据需要迅速扩大规模。从设计的角度来看,这也是一种实用的方法,以避免过度设计。

State of WPAN
WPAN的状态

On hardware, applicable HAN communications protocol and HAN device hardware were far from ready for mass deployment at the time when we started exploring in that space. That’s a non-trivial challenge for anybody who wants to get into the very space. On the other hand, if done right it represents an opportunity for one to pioneer in a relatively new arena.
在硬件上,当我们开始在该领域进行探索时,适用的 HAN 通信协议和 HAN 设备硬件还远未准备好进行大规模部署。对于任何想要进入这个领域的人来说,这都是一个不平凡的挑战。另一方面,如果做得好,它代表着一个人在一个相对较新的领域开拓的机会。

ZigBee, an IEEE 802.15.4 standard WPAN (Wireless Personal Area Network) protocol, was our selected communications protocol for scaled deployment. While it’s a robust protocol compared with others such as Z-Wave, its specifications was still undergoing changes and few real-world implementations had ever exploited its full features.
ZigBee是IEEE 802.15.4标准WPAN(无线个人局域网)协议,是我们选择用于扩展部署的通信协议。虽然与Z-Wave等其他协议相比,它是一个强大的协议,但它的规范仍在发生变化,很少有实际的实现利用过它的全部功能。

The protocol comes with a few predefined application profiles including Energy Efficiency and Home Automation profiles. Part of our core business is about translating HVAC operations data via thermostats into actionable business intelligence, hence ability to acquire key attributes from these devices is crucial. We quickly discovered that some attributes as basic as HVAC state were missing in certain application profiles and we had to not only utilize multiple profiles but also extend to using custom attributes in ZCL (ZigBee Cluster Library).
该协议附带一些预定义的应用配置文件,包括能源效率和家庭自动化配置文件。我们核心业务的一部分是通过恒温器将暖通空调运营数据转化为可操作的商业智能,因此从这些设备获取关键属性的能力至关重要。我们很快发现,某些应用程序配置文件中缺少一些与HVAC状态一样基本的属性,我们不仅必须利用多个配置文件,而且还必须扩展到使用ZCL(ZigBee集群库)中的自定义属性。

Working with technology partners
与技术合作伙伴合作

Working with hardware technology partners does present some other challenges. HAN device firmware and embedded software development is a totally different beast from SaaS/server application development. Python on Linux is a prominent embedded software platform. While that’s also a popular combo for server software development, the two worlds share little resemblance. Building a system that bridges the two worlds takes learning and collaborative effort from both camps.
与硬件技术合作伙伴合作确实会带来一些其他挑战。HAN设备固件和嵌入式软件开发与SaaS/服务器应用程序开发完全不同。Linux 上的 Python 是一个突出的嵌入式软件平台。虽然这也是服务器软件开发的一个流行组合,但这两个世界几乎没有相似之处。建立一个连接两个世界的系统需要两个阵营的学习和协作努力。

Some of our HAN device partners were quick to realize the significance of the need to back their gateway devices with a scalable PaaS infrastructure and invest significant effort in M2M (Machine-to-Machine) through acquisition and internal development. But coming from a hardware background, there was inevitably a non-trivial learning curve for our hardware partners to get it right in areas such as software service scalability. Leveraging our internal scalable SaaS development experience and our partners’ embedded software engineering expertise, we managed to put together the best ingredients from both worlds into the cooperative work.
我们的一些 HAN 设备合作伙伴很快意识到需要使用可扩展的 PaaS 基础架构支持其网关设备的重要性,并通过收购和内部开发在 M2M(机器对机器)上投入大量精力。但是,我们的硬件合作伙伴来自硬件背景,因此不可避免地要有一个不平凡的学习曲线,以便在软件服务可扩展性等领域做到正确。利用我们内部可扩展的SaaS开发经验和合作伙伴的嵌入式软件工程专业知识,我们设法将两个领域的最佳成分整合到合作工作中。

OTA firmware update
OTA 固件更新

OTA (Over-the-Air) firmware update generally refers to wireless firmware update. Our devices run on a WPAN protocol and the firmware is OTA-able. It’s probably one of the operations that create the most anxiety, as an update failure may result in “bricking” the devices in volume, leading to the worst user experience. A bricked thermostat that results in an inoperable HVAC (i.e. heater / air conditioner) would be the last thing the home occupant wants to deal with on a 105F Summer day, or worse, a potentially life-threatening hazard on a 10F Winter night.
OTA(无线)固件更新通常是指无线固件更新。我们的设备在WPAN协议上运行,固件支持OTA。这可能是产生最大焦虑的操作之一,因为更新失败可能会导致设备体积“变砖”,从而导致最糟糕的用户体验。导致HVAC(即加热器/空调)无法运行的砖砌恒温器将是家庭居住者在105F夏日最不想处理的事情,或者更糟糕的是,在10F冬夜可能危及生命的危险。

This critical task is all about making sure the entire update procedure is foolproof from end to end. The important thing is to go through lots of rehearsals in advance. In addition, the capability of rollback of firmware version is as critical as the forward-update so to undo the update should unforeseen issues arise post-update. Startups typically work at a cut-throat pace that it’s tempting to circumvent pre-production tests whenever possible. But this is one of those operations that even a minor compromise of stringent tests could mean end of business.
这项关键任务就是确保整个更新过程从头到尾都是万无一失的。重要的是提前进行大量的排练。此外,固件版本的回滚功能与前向更新一样重要,以便在更新后出现不可预见的问题时撤消更新。初创公司通常以残酷的速度工作,只要有可能,就很容易规避生产前的测试。但这是那些操作之一,即使是严格测试的微小妥协也可能意味着业务的终结。

Pull vs Push
拉与推

The around-the-clock time series data acquisition from a growing volume of primitive HAN devices is a capacity-intensive requirement. Understanding that it was going to be a temporary method for smaller-scale deployments, we started out using a simplistic pull model to mechanically acquire data from the HAN gateway devices. These devices gather data serially from their associated thermostat devices, making a single trip to a gateway-connected thermostat device cost a few seconds to tens of seconds. To come up with a data acquisition method that could scale, we needed something that is at least an order of magnitude faster.
从不断增长的原始 HAN 设备中获取全天候时间序列数据是一项容量密集型要求。了解到这将是小规模部署的临时方法,我们开始使用简单的拉取模型从 HAN 网关设备机械地获取数据。这些设备从其关联的恒温器设备串行收集数据,使得单次访问网关连接的恒温器设备的成本为几秒钟到数十秒。为了提出一种可以扩展的数据采集方法,我们需要至少快一个数量级的东西。

With larger-scale deployments in the pipeline, we didn’t waste any time and worked collaboratively with all involved parties early on to build a scalable solution. We went back to the drawing board to scrutinize the various data communication methods that are supported by the WPAN specifications and laid out a few architectural changes. First, we switched the data acquisition model from pull to push. Such change affected not only data communications within our internal SaaS applications but the end-to-end data flow spanning across our partners’ PaaS systems.
随着更大规模的部署正在进行中,我们没有浪费任何时间,而是尽早与所有相关方协作构建可扩展的解决方案。我们回到绘图板,仔细检查WPAN规范支持的各种数据通信方法,并列出了一些架构更改。首先,我们将数据采集模型从拉动切换到推送。这种变化不仅影响了我们内部 SaaS 应用程序内的数据通信,还影响了跨越合作伙伴 PaaS 系统的端到端数据流。

One of the key changes was to come up with standards compliant methods that minimize necessary data retrievals via unexploited features such as attribute grouping and differential reporting under the push model. Attribute grouping allows selected attributes to be bundled as a single packet for delivery instead of spitting individual attributes serially in multiple deliveries. Differential reporting helps minimize necessary data deliveries by triggering data transfer only when at least one of the selected attributes has changed. All that means lots of extra work for everybody in the short term, but in exchange for a scalable solution in the long run.
其中一个关键变化是提出了符合标准的方法,通过未开发的功能(如推送模型下的属性分组和差异报告)最大限度地减少必要的数据检索。属性分组允许将所选属性捆绑为单个数据包进行传递,而不是在多个传递中连续吐出单个属性。差异报告仅在至少一个选定属性发生更改时触发数据传输,从而帮助最大程度地减少必要的数据传递。所有这些都意味着每个人在短期内都需要做很多额外的工作,但从长远来看,这是一个可扩展的解决方案。

Collaborative work pays off
协作工作获得回报

The challenges mentioned above wouldn’t be resolvable hadn’t there been a team of cross-functional group technologists working diligently and creatively to make it happen. Performance was boosted by orders of magnitude after implementing the new data acquisition method. More importantly, the collective work in some way set a standard for large-scale data acquisition from SaaS-managed HAN devices. It was an invaluable experience being a part of the endeavor.
如果没有一个跨职能的团队技术人员团队,上面提到的挑战是无法解决的,他们努力和创造性地实现这一目标。实施新的数据采集方法后,性能提高了几个数量级。更重要的是,集体工作在某种程度上为从SaaS管理的HAN设备进行大规模数据采集设定了标准。作为努力的一部分,这是一次宝贵的经历。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Real-time Big Data Revisited
重新审视实时大数据

My previous blog post about real-time Big Data centers around some relevant open-source software (e.g. Storm, Kafka). This post shifts the focus towards reviewing its current state.
我之前关于实时大数据中心的博客文章围绕一些相关的开源软件(例如Storm,Kafka)。这篇文章将重点转移到审查其当前状态上。

One thing the computing technology industry has never been starved of is the successive up and down of buzzwords – B2B, P2P, SOA, AOP, M2M, SaaS/PaaS, IOT, RWD (responsive web design), SDN (software-defined networking), … Recently, Big Data is one of the few that has taken the center stage.
计算技术行业从未缺少的一件事是流行语的连续起伏 - B2B,P2P,SOA,AOP,M2M,SaaS / PaaS,IOT,RWD(响应式网页设计),SDN(软件定义网络),...最近,大数据是为数不多的占据中心舞台的领域之一。

How big is Big Data?
大数据有多大?

What is Big Data anyway? Typical structured data is in table format with columns and rows. For example, a dataset of 500,000 Web pages might be represented by 500,000 rows of data each with 3 columns of text: URL, page title, page content. In general, people use the term Big Data to represent data with large amount of columns and/or rows. But how big is big?
大数据到底是什么?典型的结构化数据采用表格格式,包含列和行。例如,一个包含 500,000 个网页的数据集可能由 500,000 行数据表示,每行数据包含 3 列文本:URL、页面标题、页面内容。通常,人们使用术语“大数据”来表示具有大量列和/或行的数据。但是多大才算大呢?

The “yield point” at which a contemporary RDBMS (relational database management system) can no longer perform well on decent server hardware is often considered the starting point for a Big Data system. That’s obviously a vague unscientific reference. In a recent startup operation, we maintained a pretty massive transactional RDBMS (with fail-over) on a couple of ordinary quad-core Xeon server boxes stuffed with a bunch of RAID 0+1 disks. There were a couple of optimally tuned transactional tables at 400+ million rows with actively used queries outer-joining them and the database performed just fine, showing no signs of yield any time soon. On the other hand, I had also seen ordinary queries bringing a database down to halt with transactional tables at just a few million rows.
当代RDBMS(关系数据库管理系统)在体面的服务器硬件上不再能够很好地运行的“屈服点”通常被认为是大数据系统的起点。这显然是一个模糊的不科学的参考。在最近的一次启动操作中,我们在几个普通的四核至强服务器盒上维护了一个相当大的事务性RDBMS(带有故障转移),这些服务器盒里塞满了一堆RAID 0+1磁盘。在 400+ 百万行处有几个经过优化调整的事务表,其中主动使用的查询外部连接它们,数据库运行良好,短期内没有显示出任何收益迹象。另一方面,我也看到普通查询使数据库停止,事务表只有几百万行。

Is Big Data for everyone?
大数据适合所有人吗?

Nevertheless, I’ve heard quite a few horror stories about companies delving into Big Data only to realize the extensive (read: expensive) R&D work was unjustified. Some grudgingly returned to the relational database model after pouring tons of resource into building a column-oriented distributed database system. It’s tempting to conclude that you need to immediately switch from RDBMS to column-oriented databases when a projection shows that your dataset will grow to 1 petabytes in 3 years. The conclusion may be flawed if the actual business requirement analysis isn’t thorough. For instance, it could be that:
尽管如此,我听过不少关于公司深入研究大数据的恐怖故事,却意识到广泛的(阅读:昂贵的)研发工作是不合理的。有些人在投入大量资源构建面向列的分布式数据库系统后,勉强回到了关系数据库模型。当预测显示您的数据集将在 3 年内增长到 1 PB 时,您很容易得出结论,您需要立即从 RDBMS 切换到面向列的数据库。如果实际的业务需求分析不彻底,则结论可能存在缺陷。例如,可能是:

  • the dataset won’t reach anywhere near a small percentage of the petabyte scale for the first 2+ years
    在前 2+ 年,数据集不会达到 PB 级的一小部分
  • data older than 3 months is not required to be in raw format and can be aggregated to only fractions of the original data volume
    超过 3 个月的数据不需要采用原始格式,并且只能聚合为原始数据量的一小部分
  • the petabytes data size is due to certain huge data fields and actual row size is under tens of millions, which can be managed with a properly administered RDBMS
    PB 级数据大小是由于某些巨大的数据字段和实际行大小低于数千万,这可以通过正确管理的 RDBMS 进行管理

There are a lot of tech discussions about the pros and cons of relational databases versus column-oriented databases so I’m not going to repeat those arguments. It suffices to say that by switching from RDBMS to column-oriented databases, you’re trading away a whole bunch of good stuff that relational databases offer, for primarily high data capacity, fast write and built-in fault tolerance.
关于关系数据库与面向列的数据库的优缺点有很多技术讨论,所以我不打算重复这些论点。可以说,通过从RDBMS切换到面向列的数据库,你正在交换关系数据库提供的一大堆好东西,主要是高数据容量,快速写入和内置容错。

Adding real-time into the mix
将实时添加到混音中

Real-time is a term subject to contextual interpretation. In a more loose sense, response time in milliseconds to a few seconds is often regarded as real-time. As data volume increases, even such a loose requirement is no easy matter.
实时是一个受上下文解释的术语。在更宽松的意义上,以毫秒到几秒钟为单位的响应时间通常被视为实时的。随着数据量的增加,即使是这样松散的要求也不是一件容易的事。

Let’s say it’s objectively determined that column-oriented database needs to be a part of your Big Data system, the next question is probably about how “real-time” you need the system to service data requests. Trying to make every bit of data in a Big Data system available for real-time (or near-real-time) random access is a difficult proposition. A more practical approach is to maintain a data warehouse with a set of updatable pre-computed views on all persisted data augmented by a real-time subsystem which provides access to the recently transacted data that hasn’t made it to the warehouse. The real-time subsystem will be kept relatively lean by regularly discarding data that has been secured in the warehouse.
假设客观地确定面向列的数据库需要成为大数据系统的一部分,下一个问题可能是关于您需要系统如何“实时”地服务数据请求。试图使大数据系统中的每一比特数据都可用于实时(或近乎实时)随机访问是一个困难的命题。更实用的方法是维护一个数据仓库,其中包含一组可更新的预计算视图,这些视图针对所有持久化数据进行了增强,该子系统通过实时子系统进行访问,该子系统提供对尚未进入仓库的最近事务处理数据的访问。通过定期丢弃仓库中已保护的数据,实时子系统将保持相对精简。

Lambda Architecture
拉姆达架构

The Lambda Architecture advocated by Nathan Marz (the creator of Storm) proposes a Big Data system composed of a batch and a real-time subsystems to cooperatively serve real-time queries across the entire persisted dataset. Based on a preview of the early-access-edition book by Marz, my understanding of the architecture is that it consists of:
由 Nathan Marz(Storm 的创建者)倡导的 Lambda 架构提出了一个由批处理和实时子系统组成的大数据系统,以协作地为整个持久化数据集的实时查询提供服务。根据Marz的抢先体验版书籍的预览,我对架构的理解是它包括:

  • a Batch Layer that appends data to the immutable master dataset and continuously refreshes batch views (in the form of query functions) by recomputing arbitrary functions on the entire dataset
    一个批处理层,用于将数据追加到不可变的主数据集,并通过重新计算整个数据集上的任意函数来持续刷新批处理视图(以查询函数的形式)
  • a Serving Layer that processes the batch views and provides query service
    处理批处理视图并提供查询服务的服务层
  • a Speed Layer that processes real-time views from newly acquired data and regularly rotates data off to the Batch Layer
    速度层,用于处理来自新获取数据的实时视图,并定期将数据旋转到批处理层

Apparently, the architecture’s underlying design is oriented towards functional programming which is in principle rooted in Lambda Calculus. Under this computing paradigm, arbitrary data processing operations are expressed as compositions of functions which are program state-independent and operate on the entire immutable dataset.
显然,该架构的底层设计面向函数式编程,原则上植根于Lambda演算。在这种计算范式下,任意数据处理操作表示为与程序状态无关并在整个不可变数据集上运行的函数组合。

The architecture also showcases the principle of separation of concern with each of the layers handling specific Big Data tasks it’s purposely designed for. The master dataset is maintained in the Batch Layer as append-only immutable raw data on a redundant distributed computing platform (e.g. Hadoop HDFS), allowing full data reprocessing in the event of major data processing errors. On the other hand, the Speed Layer would be better served by a real-time messaging or streaming system (e.g. Storm) backed by a random read-write capable persistent storage (e.g HBase). It’s an architecture that is elegant in principle and I look forward to seeing its final edition and real-world implementations.
该架构还展示了关注点分离的原则,每个层处理其专门设计的特定大数据任务。主数据集在批处理层中作为冗余分布式计算平台(例如Hadoop HDFS)上的仅追加不可变原始数据进行维护,允许在发生重大数据处理错误时进行完整的数据处理。另一方面,速度层最好由实时消息传递或流系统(例如Storm)提供,该系统由具有随机读写能力的持久存储(例如HBase)支持。这是一个原则上优雅的架构,我期待看到它的最终版本和现实世界的实现。

Is real-time Big Data ripe for mainstream businesses?
实时大数据对主流企业来说成熟了吗?

Aside from distribution companies such as Cloudera, HortonWorks, there is a wide range of companies and startups building their entire business on providing Big Data service. Then there are these tech giants (e.g. EMC) which see Big Data a significant part of their strategic direction. As to the need for real-time, there has been debate on whether the actual demand is imminent for businesses, other than a handful of global real-time search/newsfeed services such as Twitter.
除了Cloudera,HortonWorks等分销公司外,还有许多公司和初创公司将整个业务建立在提供大数据服务上。然后是这些科技巨头(例如EMC),他们将大数据视为其战略方向的重要组成部分。至于对实时的需求,除了Twitter等少数全球实时搜索/新闻源服务之外,企业的实际需求是否迫在眉睫一直存在争议。

On one hand, a bunch of commercial products and open-source software frameworks have emerged to address the very need. On the other hand, businesses at large are still struggling to interpret the actual needs (i.e. how big and how real-time) by themselves and/or customers. Here’s one data point – I recently had a discussion with a founder of a Big Data platform provider who expressed skepticism about the imminent demand for real-time Big Data based on what he heard from his customers.
一方面,一堆商业产品和开源软件框架已经出现,以满足这一需求。另一方面,整个企业仍在努力解释自己和/或客户的实际需求(即多大和多实时)。这里有一个数据点——我最近与一家大数据平台提供商的创始人进行了讨论,他根据他从客户那里听到的情况,对实时大数据的迫切需求表示怀疑。

Today, short of a robust industry-standards framework, many businesses take custom approaches to dump incoming data into a column-oriented database like HBase, perform filtering scans and output selective data into a relational database for their real-time query need. Until a readily customizable framework with a robust underlying architecture like the Lambda Architecture is available, these businesses will have to continue to exhaust engineering resource to build their own real-time Big Data solutions.
如今,由于缺乏强大的行业标准框架,许多企业采用自定义方法将传入数据转储到面向列的数据库(如 HBase)中,执行过滤扫描并将选择性数据输出到关系数据库中,以满足其实时查询需求。在具有强大的底层架构(如 Lambda 架构)的易于定制的框架可用之前,这些企业将不得不继续耗尽工程资源来构建自己的实时大数据解决方案。

2 thoughts on “Real-time Big Data Revisited
关于“重新审视实时大数据”的 2 条思考

  1. Vishwast
    Vishwast 五月 19, 2014 7:01 上午

    Hi
    I want to stream RDBMS data to message broker like kafka in real time mode. Can I accomplish it in some way?
    我想以实时模式将RDBMS数据流式传输到像kafka这样的消息代理。我能以某种方式完成它吗?

    Reply
    回复 ↓
  2. A
    A 六月 10, 2014 9:15 下午

    Thanks for sharing. Please write more articles, they are of great help.
    感谢分享。请多写文章,它们有很大的帮助。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Yet Another Startup Venture
又一个创业公司

It has been a while since I published my last blog post. Over the past couple of years, I was busy working with a small team of entrepreneurs on a startup, DwellAware, in the residential real estate space. What we set out to build is a contemporary web application that offers objective ratings derived from a wide sprectum of data sources for individual real estate properties.
自从我发表上一篇博文以来已经有一段时间了。在过去的几年里,我忙于与一小群企业家合作,在住宅房地产领域为一家名为DwellAware的初创公司工作。我们着手构建的是一个当代 Web 应用程序,它提供来自各个房地产物业的广泛数据源的客观评级。

Throughout the course of the startup venture, we maintained a skeletal staff including the no-fear CEO, the product czar, a UX designer, a couple of web app/backend engineers, a data scientist, and the engineering head (myself). The office was located in the SoMa district of San Francisco. Competing for top talent in the SF Bay is always a challenge but we were thrilled to have had some of the best talent forming the foundation team.
在整个创业过程中,我们保持了一支骨干员工队伍,包括无所畏惧的首席执行官、产品沙皇、用户体验设计师、几名 Web 应用程序/后端工程师、一名数据科学家和工程主管(我自己)。该办公室位于旧金山的SoMa区。在旧金山湾争夺顶尖人才始终是一个挑战,但我们很高兴有一些最优秀的人才组成了基金会团队。

MVP and product-market fit
MVP 和产品市场契合度

Our initial focus was to build a minimally viable product (MVP) and go through rapid iterations to achieve product-market fit. To maximize the velocity of our MVP iterations, we started out with a selected region, San Diego county. We listened to users feedback regularly and iterated continuously in accordance with the feedbacks. These feedbacks were diligently acquired thru interviews with people in local coffee shops, online usability testing as well as website activity analytics.
我们最初的重点是构建一个最小可行产品(MVP),并通过快速迭代来实现产品与市场的契合。为了最大限度地提高 MVP 迭代的速度,我们从一个选定的区域圣地亚哥县开始。我们定期听取用户反馈,并根据反馈不断迭代。这些反馈是通过对当地咖啡店人员的采访、在线可用性测试以及网站活动分析而努力获得的。

Eventually we arrived at a refined release and started to geographically expand from a single county to the entire California. Awaiting in the processing queue ready to be deployed were a number of states including Florida, Texas, New York, Illinois. The goal was to cover the 120+ millions of properties nationwide. We scaled the technology infrastructure as we expanded the geography and had all the key technology components in place.
最终,我们达成了一个精致的版本,并开始在地理上从一个县扩展到整个加利福尼亚。在准备部署的处理队列中等待的是许多州,包括佛罗里达州、德克萨斯州、纽约州、伊利诺伊州。目标是覆盖全国120 +数百万个属性。随着地理范围的扩展,我们扩展了技术基础设施,并拥有了所有关键技术组件。

Sadly we couldn’t quite make it to the finish line and had to wind down the operation. In the hindsight, perhaps there were mistakes made at both strategic and tactical level that led to the disappointing result and would warrant some hard analysis. That isn’t what this blog post is about. For now I would simply like to share some of the technological considerations and decisions made during the course of the venture.
可悲的是,我们无法完全到达终点线,不得不结束行动。事后看来,也许在战略和战术层面都犯了错误,导致结果令人失望,需要进行一些艰苦的分析。这不是这篇博文的内容。现在,我只想分享一些技术考虑和在创业过程中做出的决定。

DwellScore and HoodScore
DwellScore 和 HoodScore

A significant portion of the engineering work lied within the data science domain. In order to create an objective scoring system for individual properties in the nation that factors in hidden-cost (e.g. commute, maintenance) analysis, we exhausted various data sources from public census databases, open-source projects to commercial data providers, so as to establish a comprehensive data warehouse.
工程工作的很大一部分属于数据科学领域。为了在隐性成本(如通勤、维护)分析中为全国个别房产创建一个客观的评分系统,我们用尽了从公共人口普查数据库、开源项目到商业数据提供商的各种数据源,以建立一个全面的数据仓库。

To help real estate agents/brokers to promote their listings, we derived badges (e.g. “Safe Neighborhood”, “Low Traffic Street”, “Top Rated High School”) and blended them into listings photos for qualified real estate properties in accordance with the calculated scores. The agents/brokers were free to circulate selected badged photos by resubmitting them to associated MLSes for on-going distribution.
为了帮助房地产经纪人/经纪人推广他们的房源,我们制作了徽章(例如“安全社区”、“低流量街道”、“评分最高的高中”),并根据计算的分数将它们混合到合格房地产物业的房源照片中。代理商/经纪人可以自由地分发选定的带徽章的照片,方法是将它们重新提交给相关的MLS进行持续分发。

One of the challenges was to validate and consolidate incomplete and sometimes inaccurate data from the various sources that are often times incompatible among themselves. Even data acquired with expensive license terms was often found erroneous and incomplete. We got to the point that we were going to redefine our own nation-wide neighborhood dataset in the next upgrade.
挑战之一是验证和整合来自各种来源的不完整,有时是不准确的数据,这些数据往往彼此不兼容。即使是使用昂贵的许可条款获得的数据也经常被发现是错误和不完整的。我们到了要在下一次升级中重新定义我们自己的全国性邻域数据集的地步。

Nevertheless, we were able to come up with our first-generation scores for individual properties (DwellScore) and neighborhoods (HoodScore), backed by some extensive data science work that aggregate sub-scores in areas of cost analysis, crime rate, school districts, neighborhood lifestyle and economics. Among the sub-scores was a comfort score that includes a number of unique ingredients including noise. To come up with just noise ranking, we had to comb through data and heat maps related to aircraft , railroad and road traffic count, all from different sources.
尽管如此,我们还是能够为个别房产(DwellScore)和社区(HoodScore)提出第一代分数,并得到了一些广泛的数据科学工作的支持,这些工作汇总了成本分析,犯罪率,学区,社区生活方式和经济领域的子分数。在各分项目中,舒适度得分包括许多独特的成分,包括噪点。为了得出噪声排名,我们必须梳理与飞机、铁路和公路交通数量相关的数据和热图,所有这些都来自不同的来源。

The fact that a number of technology partners were interested in acquiring our data science work at the end of the venture does speak to its quality and comprehensiveness.
事实上,许多技术合作伙伴有兴趣在合资企业结束时收购我们的数据科学工作,这确实说明了它的质量和全面性。

NLP & computer vision
NLP 和计算机视觉

Real estate listings have long been known for their lack of completeness and accuracy. There are hundreds of MLSes administered using disparate data management systems and possibly over a million real estate brokers/agents in the nation. As a result, listings data not only needs to be up-to-date, but should also be systematically validated in order to be trustable.
长期以来,房地产清单一直以其缺乏完整性和准确性而闻名。全国有数百个使用不同的数据管理系统管理的MLS,可能超过一百万的房地产经纪人/代理商。因此,房源数据不仅需要最新,而且还应进行系统验证,以确保可信。

We experimented using of NLP (natural language processing) to help validate listings data by extracting and interpreting data of interest from latest free-form text entered by agents. In addition, we worked with a computer vision company to process massive volume of listings images via pattern recognition and machine learning. Certain characteristics of individual property listings, such as curb appeal, actual living area to lot size ratio, existence of power lines, etc, could be identified through computer vision.
我们尝试使用 NLP(自然语言处理)来帮助验证列表数据,方法是从代理输入的最新自由格式文本中提取和解释感兴趣的数据。此外,我们还与一家计算机视觉公司合作,通过模式识别和机器学习处理大量房源图像。个别房产清单的某些特征,如路边吸引力、实际居住面积与地块面积的比率、电力线的存在等,可以通过计算机视觉识别。

Technology stack

We adopted Node.js as the core tech stack for our web-centric application. Python was used as the backend/data-mining platform for data processing tasks such as real estate listings import from MLSes as well as for data-science number crunching. In addition, we also developed data service APIs for internal consumption using Tornado servers to abstract Node.js from having to handle data processing routines.
我们采用 Node.js 作为我们以 Web 为中心的应用程序的核心技术堆栈。Python被用作数据处理任务的后端/数据挖掘平台,例如从MLS导入房地产列表以及数据科学数字运算。此外,我们还开发了数据服务 API,供内部使用,使用 Tornado 服务器将 Node.js 从处理数据处理例程中抽象出来。

MySQL was initially chosen as the database management system for OLTP data storage and data warehousing. While Python has a rich set of libraries for geospatial/GIS (geographic information system) which constitutes a significant portion of our core development work, on the database front it didn’t take long for us to hit the limit of geospatial capability offered by MySQL’s latest stable release. Apparently, PostgreSQL equipped with PostGIS has been the de facto database choice for most geospatialists in recent years. Understanding that a database transition was going to cost us non-trivial effort, it’s one of those uncompromisable actions we had to take. Switching the database platform was made easy with SQLAlchemy providing the ORM (object-relational mapping) abstraction layer on Python.
MySQL最初被选为OLTP数据存储和数据仓库的数据库管理系统。虽然Python拥有一套丰富的地理空间/ GIS(地理信息系统)库,这构成了我们核心开发工作的重要组成部分,但在数据库方面,我们很快就达到了MySQL最新稳定版本提供的地理空间功能的极限。显然,近年来,配备PostGIS的PostgreSQL一直是大多数地理空间学家事实上的数据库选择。了解数据库转换将花费我们付出不平凡的努力,这是我们必须采取的那些不妥协的行动之一。SQLAlchemy在Python上提供了ORM(对象关系映射)抽象层,使切换数据库平台变得容易。

Geospatial search

GoogleMaps API has great features for maps, street views and geocoded address search, but there were still cases where a separate custom search solution could complement the search functionality. PostgreSQL has a Trigram (pg_trgm) module which maintains trigram-based indexes over text columns for similarity search. That helps add some crude NLP (natural language processing) capability to the search functionality necessary for more user-friendly geographical search (e.g. for property address).
GoogleMaps API在地图,街景和地理编码地址搜索方面具有强大的功能,但仍有一些单独的自定义搜索解决方案可以补充搜索功能。PostgreSQL有一个Trigram(pg_trgm)模块,该模块在文本列上维护基于trigram的索引以进行相似性搜索。这有助于在搜索功能中添加一些粗略的NLP(自然语言处理)功能,以实现更用户友好的地理搜索(例如物业地址)。

While Postgres’ Trigram is a viable tool, it directly taxes on the database and could impact performance as the database volume continues to grow. To scale up search independently from the database, we picked Elasticsearch. Elasticsearch comes with a comprehensive set of functions for robust text search (partial match, fuzzy match, human language, synonym support, etc) via its underlying n-gram lexical analyzer. In addition, it also has basic functions for geolocation, supporting complex shapes in GeoJSON format. In brief, Elasticsearch appears to fit well into our search requirement.
虽然Postgres的Trigram是一个可行的工具,但它直接对数据库征税,并且随着数据库数量的持续增长,可能会影响性能。为了独立于数据库扩展搜索,我们选择了Elasticsearch。Elasticsearch 通过其底层 n-gram 词汇分析器提供了一套全面的功能,用于强大的文本搜索(部分匹配、模糊匹配、人类语言、同义词支持等)。此外,它还具有地理定位的基本功能,支持GeoJSON格式的复杂形状。简而言之,Elasticsearch似乎非常适合我们的搜索要求。

Cloud computing platform
云计算平台

We picked Amazon AWS as our hosting and cloud computing platform, so using its CloudFront as the CDN was a logical step. Other readily available AWS services also offer useful tools in various areas. On the operation front, Route 53 is a DNS service one might find some competitive edge over other existing services out there. For instance, it supports setting up canonical name (CNAME) for the base domain name that many big-name DNS hosting services don’t. Amazon’s elastic load balancer (ELB) also makes load-balancing setup easy and allows centralized digital certificate setup. With wildcard digital certificate for a base domain name and a security policy that permits ending SSL/TLS at the load balancer, secure website setup could be made real simple.
我们选择了Amazon AWS作为我们的托管和云计算平台,因此使用其CloudFront作为CDN是一个合乎逻辑的步骤。其他现成的 AWS 服务也在各个领域提供了有用的工具。在运营方面,Route 53 是一项 DNS 服务,人们可能会发现比其他现有服务更具竞争优势。例如,它支持为基本域名设置规范名称 (CNAME),而许多大牌 DNS 托管服务则没有。Amazon 的弹性负载均衡器 (ELB) 还使负载平衡设置变得容易,并允许集中设置数字证书。借助基本域名的通配符数字证书和允许在负载均衡器上结束 SSL/TLS 的安全策略,安全的网站设置可以变得非常简单。

Security-wise, AWS now offers a rather high degree of flexibility for role-based security policy and security group setup. On database, Amazon’s RDS provides a data persistence storage solution to shield one from having to deal with building and maintaining individual relational database servers. I had a lot of reservation when evaluating AWS security in a prior startup venture about its readiness to provide a production-grade infrastructure. I must say that it has improved a great deal since.
在安全性方面,AWS 现在为基于角色的安全策略和安全组设置提供了相当高的灵活性。在数据库方面,亚马逊的RDS提供了一种数据持久性存储解决方案,使人们不必处理构建和维护单个关系数据库服务器的问题。在之前的一家初创公司评估 AWS 安全性时,我对它是否准备好提供生产级基础设施持保留态度。我必须说,自那时以来,情况有了很大改善。

A fun run
有趣的跑步

Although the venture lasted just slightly over two years, it was a fun run. We fostered a culture of transparency and best-ideas-win. We also embraced risk-taking and fast-learning on many fronts, including adopting and picking up bleeding-edge technologies not entirely within our comfort zone. Below are a couple of pictures taken on the day the first production application was launched back in Summer of 2014:
虽然这次冒险只持续了两年多一点,但这是一次有趣的跑步。我们培养了一种透明和最佳创意获胜的文化。我们还在许多方面接受了冒险和快速学习,包括采用和学习不完全在我们的舒适区内的前沿技术。以下是2014年夏天第一个生产应用程序启动当天拍摄的几张照片:

The crowd in the engineering room
工程室里的人群

Engineering room

Launching the first web site
推出第一个网站

Launching 1st web site

3 thoughts on “Yet Another Startup Venture
关于“又一个创业”的 3 条思考

  1. Pingback: Adopting Node.js In The Core Technology Stack | Genuine Blog
    pingback:在核心技术堆栈中采用节点.js |正版博客

  2. Pingback: PostgreSQL Table Partitioning | Genuine Blog
    pingback: PostgreSQL 表分区 |正版博客

  3. Pingback: Streaming ETL With Alpakka Kafka | Genuine Blog
    Pingback: Streaming ETL with Alpakka Kafka |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Adopting Node.js In The Core Tech Stack
在核心技术堆栈中采用 Node.js

At a startup company, DwellAware, I’ve been with recently, I was tasked to build a web-centric application with a backend for comprehensive data analytics in the residential real estate space. Nevertheless, this post is not about the startup venture. It’s about Node.js, the technology stack chosen to power the application. Programming platforms considered at the beginning of the venture include Scala/Play, PHP/Laravel, Python/Twisted, Ruby/Sinatra and Javascript/Node.js.
在我最近工作的一家初创公司DwellAware,我的任务是构建一个以Web为中心的应用程序,其中包含用于住宅房地产空间综合数据分析的后端。然而,这篇文章不是关于创业的。这是关于Node.js,选择为应用程序提供支持的技术堆栈。创业之初考虑的编程平台包括Scala/Play、PHP/Laravel、Python/Twisted、Ruby/Sinatra和Javascript/Node.js。

Neither is it a blog post about comparing programming platforms. I’m going to simply state that Node.js was picked mainly for a few reasons:
它也不是一篇关于比较编程平台的博客文章。我将简单地说明选择Node.js主要有几个原因:

  1. its lean-and-mean minimalist design principle is in line with how I would like to run things in general,
    它的精益和平均极简主义设计原则符合我的总体运行方式,
  2. its event-driven, non-blocking-I/O architecture is well suited for contemporary high-concurrency web-centric applications, and,
    其事件驱动、非阻塞 I/O 架构非常适合当代以 Web 为中心的高并发应用程序,并且,
  3. keeping the entire web application to a single programming platform, since contemporary client-side features are heavily and ubiquitously implemented using Javascript anyway.
    将整个Web应用程序保留在单个编程平台上,因为无论如何,当代的客户端功能都是使用Javascript大量且无处不在的实现的。

Is adopting Node.js a justifiable risk?
采用 Node.js 是一种合理的风险吗?

In fact, that was the original title of the blog post. I was going to blog about the necessary research for adopting Node.js as the tech stack for the core web application back in 2013. It never grew to more than a few bullet points and was soon buried deep down the priority to-do list.
事实上,这是博客文章的原始标题。早在 2013 年,我就打算在博客上讨论采用 Node.js 作为核心 Web 应用程序的技术堆栈的必要研究。它从未增长到超过几个要点,很快就被深埋在优先级待办事项列表中。

Javascript has been used on the client side in web applications for a long time. Handling non-blocking events triggered by human activities on a web browser is one thing, dealing with split-second server events and I/O activities on the server side in a non-blocking fashion is a little different. Node.js’s underlying event-driven non-blocking architecture does help somewhat flatten the learning curve to Javascript developers.
Javascript已经在Web应用程序的客户端使用了很长时间。在 Web 浏览器上处理由人类活动触发的非阻塞事件是一回事,以非阻塞方式处理服务器端的瞬间服务器事件和 I/O 活动略有不同。Node.js的底层事件驱动的非阻塞架构确实有助于拉平Javascript开发人员的学习曲线。

Although new Node modules emerged almost daily to try address just about anything in any problem space one could think of, not many of them prove to be very useful, let alone production-grade. That was two years ago. Admittedly, a lot has changed over the past couple of years and Node has definitely become more mature everyday. By most standards, Node.js is still a relatively young technology though.
尽管几乎每天都会出现新的 Node 模块,以尝试解决人们能想到的任何问题空间中的任何问题,但其中没有多少模块被证明非常有用,更不用说生产级了。那是两年前的事了。诚然,在过去的几年里,很多事情都发生了变化,Node肯定每天都变得更加成熟。按照大多数标准,Node.js仍然是一项相对年轻的技术。

Anyway, let’s rewind back to Fall 2013.
无论如何,让我们回到2013年秋季。

Built on Google’s V8 Javascript engine, Node is a Javascript-based server platform designed to efficiently run I/O-intensive server applications. For a long time, Javascript was being considered a client-side-only technology. Node.js has made it a serious contender for server-side technology. The fact that prominent software companies such as Microsoft, eBay, LinkedIn, adopted Node.js in some of their products/services was more or less testimonial. While hypes about certain seemingly arbitrary technologies have always been a phenomenon in the Silicon Valley, I wouldn’t characterize the recent uprising of Javascript and Node a mere hype.
Node基于Google的V8 Javascript引擎构建,是一个基于Javascript的服务器平台,旨在高效运行I / O密集型服务器应用程序。长期以来,Javascript被认为是一种仅限客户端的技术。Node.js使其成为服务器端技术的有力竞争者。微软、eBay LinkedIn等著名软件公司在其某些产品/服务中采用了Node.js这一事实或多或少是证明。虽然对某些看似任意的技术的炒作在硅谷一直是一种现象,但我不会将最近Javascript和Node的兴起描述为纯粹的炒作。

Node.js modules

Node by itself is just a barebone server, hence picking suitable modules was one of the upfront tasks. One of the core modules that was an essential part of Node’s middleware framework is Connect, which provides chaining of functions and enhances Node’s http module. ExpressJS further equips Node with rich web app features on top of Connect. To take advantage of multi-core/processor server configuration, Node offers a method child_process.fork() for spawning worker processes that are capable of communicating with their parent via built-in IPC (Inter-process Communication).
Node本身只是一个准系统服务器,因此选择合适的模块是前期任务之一。Connect是Node中间件框架的重要组成部分的核心模块之一,它提供了函数链接并增强了Node的http模块。ExpressJS在Connect之上进一步为Node配备了丰富的Web应用程序功能。为了利用多核/处理器服务器配置,Node 提供了一种方法 child_process.fork() 来生成能够通过内置 IPC(进程间通信)与其父进程通信的工作进程。

On build tool, we started out with Grunt then later shifted to Gulp partly for the speed due to Gulp’s streaming approach. But we were happy with Grunt as well. Node uses Jade as its default templating engine. We didn’t like the performance, so we evaluated a couple of alternate templating engines including doT.js and Swig, and were shocked to see performance gain in an order of magnitude. We promptly switched to Swig (with doT.js a close second).
在构建工具上,我们从Grunt开始,后来转向Gulp,部分原因是由于Gulp的流媒体方法而加快了速度。但我们对咕噜也很满意。Node 使用 Jade 作为其默认的模板引擎。我们不喜欢这种性能,所以我们评估了几个替代模板引擎,包括 doT.js 和 Swig,并震惊地看到性能提升了一个数量级。我们立即切换到Swig(使用doT.js紧随其后)。

On test framework, we used Mocha.js with assertion libray, Chai.js, which supports BDD (Behavior-driven Development) assertion style.
在测试框架上,我们使用了带有断言库.js Mocha,Chai.js,它支持 BDD(行为驱动开发)断言风格。

Data persistence, caching, content delivery, etc
数据持久性、缓存、内容交付等

A key part of our product offerings is about data intelligence, thus databases for both OLTP and warehousing are critical components of the technology stack. MongoDB has been a default database choice for many Node.js applications for good reasons. The emerging MEAN (MongoDB-ExpressJS-AngularJS-Node.js) framework hints the popularity of the Node-MongoDB combo. So Mongo was definitely a considered database. After careful consideration, we decided to go with MySQL. One consideration being that it wouldn’t be too hard to hire a DBA/devops with MySQL experience given its popularity. Both Node.js and MongoDB were relatively new products and we didn’t have in-house MongoDB expertise at the time, so taming one beast (Node in this case) at a time was a preferred route.
我们产品的一个关键部分是关于数据智能,因此OLTP和仓储数据库是技术堆栈的关键组成部分。MongoDB一直是许多Node.js应用程序的默认数据库选择,这是有充分理由的。新兴的MEAN(MongoDB-ExpressJS-AngularJS-Node.js)框架暗示了Node-MongoDB组合的流行。所以Mongo绝对是一个经过深思熟虑的数据库。经过深思熟虑,我们决定使用 MySQL。一个考虑因素是,鉴于MySQL的受欢迎程度,雇用具有MySQL经验的DBA / DevOps不会太难。Node.js和MongoDB都是相对较新的产品,我们当时没有内部的MongoDB专业知识,所以一次驯服一个野兽(在这种情况下是Node)是首选路线。

There weren’t many Node-MySQL modules out there, though we managed to adopt a simplistic MySQL module that also provides simple connection pooling. Later on, due to the superior geospatial functionality of PostGIS available in the PostgreSQL ecosystem, we migrated from MySQL to PostgreSQL. Thanks to the vast Node.js module repository, there were Node-PostgreSQL modules readily available for connection pooling. To cache frequently referenced application data, we used Redis as a centralized cache store.
Node-MySQL模块并不多,尽管我们设法采用了一个简单的MySQL模块,该模块也提供了简单的连接池。后来,由于PostgreSQL生态系统中PostGIS的卓越地理空间功能,我们从MySQL迁移到PostgreSQL。由于庞大的 Node.js 模块存储库,有现成的 Node-PostgreSQL 模块可用于连接池。为了缓存经常引用的应用程序数据,我们使用 Redis 作为集中式缓存存储。

Besides dynamic content rendered by application, we were building a web presence also with a lot of static content of various types including images and certain client-side application data. To serve static web content, a few typical approaches, including using a proxy web server, content delivery network (CDN), have been reviewed. On proxy server, Nginx has been on its rise to overtake Apache to become the most popular HTTP server. Its minimalist design is kind of like Node’s. We did some load-testing of static content on Node which appears to be a rather efficient static content server. We decided a proxy server wasn’t necessary at least in the immediate term. As to CDN, we used Amazon’s CloudFront.
除了应用程序呈现的动态内容外,我们还构建了一个Web存在,其中包含许多各种类型的静态内容,包括图像和某些客户端应用程序数据。为了提供静态 Web 内容,已经审查了一些典型的方法,包括使用代理 Web 服务器、内容交付网络 (CDN)。在代理服务器上,Nginx一直在超越Apache成为最受欢迎的HTTP服务器。它的极简主义设计有点像Node的。我们在 Node 上对静态内容进行了一些负载测试,这似乎是一个相当高效的静态内容服务器。我们认为至少在短期内不需要代理服务器。至于CDN,我们使用了亚马逊的CloudFront。

Score calculation & image processing
分数计算和图像处理

Part of the core value proposition of the product was to come up with objective scores in individual residential real estate properties and neighborhoods so as to help users to make intelligent choice in buying/selling their homes. As described in a previous blog post, a lot of data science work in a wide spectrum of areas (cost analysis, crime, schools, comfort, noise, etc) was performed to generate the scores.
该产品的核心价值主张之一是在单个住宅房地产物业和社区中提出客观分数,以帮助用户在购买/出售房屋时做出明智的选择。正如之前的博客文章所述,在广泛的领域(成本分析,犯罪,学校,舒适度,噪音等)进行了大量的数据科学工作来生成分数。

Based on the computed scores, we then derived badges for qualified real estate properties in different areas (e.g. “Low Energy Bills”, “Safe Neighborhood”, “Top Rated Elementary School”). The badges were embedded in selected photos of individual real estate properties, which could then be fed back into the listings distribution cycle by resubmitting into the associated MLSes if the real estate agents/brokers chose to.
根据计算出的分数,我们随后为不同地区的合格房地产物业(例如“低能源账单”、“安全社区”、“评分最高的小学”)推导出徽章。徽章嵌入在各个房地产物业的选定照片中,如果房地产经纪人/经纪人选择,可以通过重新提交到相关的MLS中将其反馈到列表分发周期中。

All the necessary score calculation and image processing for badges were done in the backend on a Python platform with PostgreSQL databases. Python Tornado servers were used as data service API along with basic caching for Node.js to consume data as presentation content.
徽章的所有必要分数计算和图像处理都是在带有PostgreSQL数据库的Python平台上的后端完成的。Python Tornado 服务器被用作数据服务 API,以及 Node 的基本缓存.js将数据用作演示内容。

Here’s a screen-shot of the Dwelling Page for a given real estate property, showing its DwellScore:
以下是给定房地产物业的住宅页面的屏幕截图,显示了其住宅分数:

DwellAware DwellScore

Geospatial maps & search
地理空间地图和搜索

For geographical maps and search, Google Maps API was extensively used from within Node.js. We gecoded in advance all real estate property addresses using the API as part of the backend data processing routine so as to take advantage of Google’s superior search capability.
对于地理地图和搜索,Google Maps API在Node.js中被广泛使用。作为后端数据处理例程的一部分,我们使用API提前对所有房地产地址进行了编码,以便利用Google卓越的搜索功能。

To supplement the already pretty robust Google Maps search from within Node.js to better utilize our own geospatial data content, we experimented using an Elasticsearch module which comes with their n-gram lexical analyzer for fuzzy-match search. The test result was promising. An advantage of using such an autonomous search system is that it doesn’t directly tax on the Node.js server or the PostgreSQL database (e.g. pg_trgm) as traffic load increases.
为了补充Node中已经非常强大的谷歌地图搜索.js为了更好地利用我们自己的地理空间数据内容,我们尝试使用Elasticsearch模块,该模块附带了用于模糊匹配搜索的n-gram词法分析器。测试结果很有希望。使用这种自主搜索系统的一个优点是,随着流量负载的增加,它不会直接对 Node.js 服务器或 PostgreSQL 数据库(例如 pg_trgm)征税。

Below is a screen-shot of the Search Page centering around San Diego:
以下是以圣地亚哥为中心的搜索页面的屏幕截图:

DwellAware Search

Fast-forward to the present
快进到现在

As mentioned earlier, Node.js has evolved quite a bit over the past couple of years — the rather significant feature/performance improvements from the v0.10 to v0.12, the next LTS (long-term-support) release incorporating the latest V8 Javascript engine and ES6 ECMA features, the fork-off to io.js which later merged back to Node, …, all sound promising and exciting.
如前所述,Node.js在过去几年中已经发展了很多 - 从v0.10到v0.12的相当重要的功能/性能改进,下一个LTS(长期支持)版本结合了最新的V8 Javascript引擎和ES6 ECMA功能,分叉到io.js后来合并回Node,...,听起来很有希望和令人兴奋。

In conclusion, given the evident progress of Node’s development I’d say it’s now hardly a risk to adopt Node for building general web-centric applications, provided that your engineering team possesses sufficiently strong Javascript skills. It wasn’t a difficult decision for me two years ago to pick Node as the core technology stack, and would be an even easier one today.
总之,鉴于Node开发的明显进展,我认为现在采用Node来构建以Web为中心的通用应用程序几乎没有风险,前提是您的工程团队拥有足够强大的Javascript技能。两年前,选择Node作为核心技术堆栈对我来说并不是一个困难的决定,今天甚至会更容易。

For more screen-shots of the website, click here.
有关该网站的更多屏幕截图,请单击此处 。

3 thoughts on “Adopting Node.js In The Core Tech Stack
关于“在核心技术堆栈中采用 Node.js”的 3 条思考

  1. Pingback: Yet Another Startup Venture | Genuine Blog
    Pingback:又一个创业公司 |正版博客

  2. Pingback: Database CRUD in Scala-Play | Genuine Blog
    pingback: Scala-Play 中的数据库 CRUD |正版博客

  3. Pingback: Self-contained Node.js Deployment | Genuine Blog
    pingback:自包含节点.js部署 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

An Android Board Game: Sweet Spots
安卓棋盘游戏:甜蜜点

Back in 2013, one of my planned to-do items was to explore the Android programming language. As always, the best way to learn a programming platform is to write code on it. At the time I was hooked on an interesting board game called Alberi (based on some mathematical puzzle designed by Giorgio Dendi), so I decided to develop a board game using the very game logic as a programming exercise. I was going to post it in a blog but was buried in a different project upon finishing the code.
早在 2013 年,我计划要做的事情之一就是探索 Android 编程语言。与往常一样,学习编程平台的最佳方法是在其上编写代码。当时我迷上了一款名为Alberi的有趣棋盘游戏(基于Giorgio Dendi设计的一些数学谜题),所以我决定使用游戏逻辑作为编程练习来开发一款棋盘游戏。我打算将其发布在博客中,但在完成代码后被埋葬在不同的项目中。

This was the first Android application I’ve ever developed. It turned out to be a little more complex than initially thought as a first programming exercise on a platform unfamiliar to me. Nevertheless, it was fun to develop a game I enjoyed playing. Android is Java-based so to me the learning curve is not that steep, and the Android SDK comes with a lot of sample code that can be borrowed.
这是我开发的第一个Android应用程序。事实证明,它比最初想象的要复杂一些,因为它是我陌生的平台上的第一次编程练习。尽管如此,开发一款我喜欢玩的游戏还是很有趣的。Android是基于Java的,所以对我来说,学习曲线并没有那么陡峭,Android SDK附带了很多可以借用的示例代码。

The game is pretty simple. For a given game, the square-shaped board is composed of N rows x N columns of square cells. The entire board is also divided into N contiguous colored zones. The goal is to distribute a number of treasure chests over the board with the following rules:
游戏非常简单。对于给定的游戏,方形棋盘由 N 行 x N 列的方形单元格组成。整个电路板也被划分为N个连续的彩色区域。目标是按照以下规则在棋盘上分配一些宝箱:

  1. Each row must have 1 treasure chest
    每排必须有1个宝箱
  2. Each column must have 1 treasure chest
    每列必须有1个宝箱
  3. Each zone must have 1 treasure chest
    每个区域必须有1个宝箱
  4. Treasure chests cannot be adjacent row-wise, column-wise or diagonally to each other
    宝箱不能逐行、按列或对角线相邻
  5. There is also a variant of 2 treasure chests (per row/column/zone) at larger board size
    还有一个变体,即 2 个宝箱(每行/列/区域)的更大板尺寸

Here’s a screen-shot of the board game (N = 6):
以下是棋盘游戏的屏幕截图 (N = 6):

SweetSpots screenshot

SweetSpots screenshot
甜蜜点截图

Publishing the game app on Google Play
在谷歌播放上发布游戏应用

Back then I didn’t publish the game on Google Play. I’ve decided to do it now just to try out the process. To do that, I need to create a signed Android application package (APK) then zip-align it as per Google’s publish requirement. Eclipse and Android SDK along with Android Debug Bridge (adb) for the Android device simulator were used for developing the app back in 2013. Android OS version at the time was Jelly Bean, although the game still plays fine today on my Lollipop Android phone. The Eclipse version used for the development was Juno and Android SDK version was 17.0.0.
那时候我没有在Google Play上发布游戏。我决定现在这样做只是为了尝试这个过程。为此,我需要创建一个签名的Android应用程序包(APK),然后根据Google的发布要求对其进行zip对齐。Eclipse和Android SDK以及Android设备模拟器的Android Debug Bridge(adb)在2013年用于开发该应用程序。当时的Android操作系统版本是Jelly Bean,尽管游戏今天在我的Lollipop Android手机上仍然玩得很好。用于开发的Eclipse版本是Juno,Android SDK版本是17.0.0。

Just two years later today, while the game app still runs fine as an unsigned APK on the current Android platform it no longer builds properly on the latest Eclipse (Mars) and Android SDK (v24.0.2), giving the infamous “R cannot be resolved” error. Lots of suggestions out there on how to solve the problem such as fixing the resource XML, modifying build path, etc, but unfortunately none applies.
仅仅两年后的今天,虽然游戏应用程序在当前的Android平台上仍然作为未签名的APK运行良好,但它不再在最新的Eclipse(Mars)和Android SDK(v24.0.2)上正确构建,从而给出了臭名昭著的“R无法解决”错误。关于如何解决问题有很多建议,例如修复资源XML,修改构建路径等,但不幸的是,没有一个适用。

As Google is ending support for Android Developer Tool (ADT) in Eclipse literally by end of the month, leaving IntelliJ-based Android Studio the de facto IDE for future Android app development, I thought I would give it a shot. Android Studio appears to be a great IDE product and importing the Eclipse project to it was effortless. It even nicely correlates dependencies and organizes multiple related projects into one. But then a stubborn adb connection problem blocked me from moving forward. I decided to move back to Eclipse. Finally, after experimenting and mixing the Android SDK build tools and platform tools with older versions I managed to successfully build the app. Here’s the published game at Google Play.
由于Google将在本月底结束对Eclipse中Android Developer Tool(ADT)的支持,使基于IntelliJ的Android Studio成为未来Android应用程序开发的实际IDE,我想我会试一试。Android Studio似乎是一个很棒的IDE产品,将Eclipse项目导入其中毫不费力。它甚至可以很好地关联依赖项,并将多个相关项目组织成一个。但随后一个顽固的 adb 连接问题阻止了我前进。我决定搬回Eclipse。最后,在尝试并将Android SDK构建工具和平台工具与旧版本混合后,我设法成功构建了该应用程序。这是在谷歌播放上发布的游戏。

From Tic-tac-toe to SweetSpots
从井字游戏到甜蜜点

The Android’s SDK version I used comes with a bunch of sample applications along with working code. Among the applications is a Tic-tac-toe game which I decided would serve well as the codebase for the board game. I gave the game a name called Sweet Spots.
我使用的Android的SDK版本附带了一堆示例应用程序以及工作代码。在这些应用程序中,有一个井字游戏,我决定将其作为棋盘游戏的代码库。我给游戏起了一个名字,叫做甜蜜点。

Following the code structure of the Tic-tac-toe sample application, there are two inter-dependent projects for Sweet Spots: SweetSpotsMain and SweetSpotsLib, each with its own manifest file (AndroidManifest.xml). The file system structure of the source code and resource files is simple:
遵循井字游戏示例应用程序的代码结构,有两个相互依赖的Sweet Spots项目:SweetSpotsMain和SweetSpotsLib,每个项目都有自己的清单文件(AndroidManifest.xml)。源代码和资源文件的文件系统结构很简单:

MainActivity (extends Activity)
主活动(扩展活动)

The main Java class in SweetSpotsMain is MainActivity, which defines method onCreate() that consists of buttons for games of various board sizes. In the original code repurposed from the Tic-tac-toe app, each of the game buttons uses its own onClickListener that defines onClick(), which in turn calls startGame() to launch GameActivity using startActivity() by means of an Intent object with the GameActivity class. It has been refactored to have the activity class implement onClickListener and override onCreate() with setOnClickListener(this) and onClick() with specific actions for individual buttons.
SweetSpotsMain 中的主要 Java 类是 MainActivity,它定义了方法 onCreate(),该方法由各种棋盘大小的游戏按钮组成。在从井字游戏应用程序重新利用的原始代码中,每个游戏按钮都使用自己的定义 onClick() 的 onClickListener,而 onClick() 又调用 startGame() 以使用 startActivity() 通过具有 GameActivity 类的 Intent 对象启动 GameActivity。它已被重构为让活动类实现 onClickListener 并使用 setOnClickListener(this) 和 onClick() 覆盖 onCreate() 以及具有单个按钮的特定操作。

View source code of MainActivity.java in a separate browser tab.
在单独的浏览器选项卡中查看 MainActivity 的源代码.java。

GameActivity (extends Activity)
游戏活动(扩展活动)

One of the main Java classes in SweetSpotsLib is GameActivity. It defines a few standard activities including onCreate(), onPause() and onResume(). GameActivity also implements a number of onClickListener’s for operational buttons such as Save, Restore and Confirm. The Save and Restore buttons are for temporarily saving the current game state to be restored, say, after trying a few tentative moves. Clicking on the Confirm button will initiate validation of the game rules.
SweetSpotsLib 中的主要 Java 类之一是 GameActivity。它定义了一些标准活动,包括onCreate(),onPause()和onResume()。GameActivity 还实现了许多 onClickListener 的操作按钮,例如保存、恢复和确认。“保存”和“还原”按钮用于临时保存当前游戏状态,以便在尝试一些试探性动作后恢复。单击“确认”按钮将启动游戏规则的验证。

View source code of GameActivity.java in a separate browser tab.
在单独的浏览器选项卡中查看游戏活动的源代码.java。

GameView (extends View)
游戏视图(扩展视图)

The other main Java class in SweetSpotsLib is GameView, which defines and maintains the view of the board game in accordance with the activities. It defines many of the game-logic methods within standard method calls including onDraw(), onMeasure(), onSizeChanged(), onTouchEvent(), onSaveInstanceState() and onRestoreInstanceState().
SweetSpotsLib中的另一个主要Java类是GameView,它根据活动定义和维护棋盘游戏的视图。它定义了标准方法调用中的许多游戏逻辑方法,包括onDraw(),onMeasure(),onSizeChanged(),onTouchEvent(),onSaveInstanceState()和onRestoreInstanceState()。

GameView also consists of interface ICellListener with the abstract method onCellSelected() that is implemented in GameActivity. The method does nothing in GameActivity but could be added with control logic if wanted.
GameView还包含接口ICellListener和在GameActivity中实现的抽象方法onCellSelected()。该方法在游戏活动中不执行任何操作,但如果需要,可以使用控制逻辑添加。

View source code of GameView.java in a separate browser tab.
在单独的浏览器选项卡中查看 GameView 的源代码.java。

Resource files

Images and layout (portrait/landscape) are stored under the res/ subdirectory. Much of the key parametric data (e.g. board size) is also stored there in res/values/strings.xml. Since this was primarily a programming exercise on a mobile platform, visual design/UI wasn’t given much effort. Images used in the board game were assembled using Gimp from public domain sources.
图像和布局(纵向/横向)存储在 res/ 子目录下。许多关键参数数据(例如电路板尺寸)也存储在res/values/string.xml中。由于这主要是移动平台上的编程练习,因此视觉设计/ UI并没有付出太多努力。棋盘游戏中使用的图像是使用来自公共领域来源的 GIMP 组装的。

Complete source code for the Android board game is at: https://github.com/oel/sweetspots
安卓棋盘游戏的完整源代码位于: https://github.com/oel/sweetspots

How were the games created?
游戏是如何创建的?

These games were created using a separate Java application that, for each game, automatically generates random zones on the board and validate the game via trial-and-error for a solution. I’ll talk about the application in a separate blog post when I find time. One caveat about the automatic game solution creation is that the solution is generally not unique. A game with unique solution would allow some interesting game-playing logic to be more effective in solving the game. One way to create a unique solution would be to manually re-shape the zones in the generated solution.
这些游戏是使用单独的 Java 应用程序创建的,对于每个游戏,该应用程序会在棋盘上自动生成随机区域,并通过试错法验证游戏以获得解决方案。当我有时间时,我将在单独的博客文章中讨论该应用程序。关于自动游戏解决方案创建的一个警告是,该解决方案通常不是唯一的。具有独特解决方案的游戏将允许一些有趣的游戏逻辑更有效地解决游戏。创建唯一解决方案的一种方法是手动重塑生成的解决方案中的区域。

1 thought on “An Android Board Game: Sweet Spots
关于“安卓棋盘游戏:甜蜜点”的 1 条思考

  1. Pingback: Solving The Sweet Spots Board Game | Genuine Blog
    平回:解决甜蜜点棋盘游戏 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Solving The Sweet Spots Board Game
解决甜蜜点棋盘游戏

Creating the Android board game, Sweet Spots, was a fun programming exercise, although developing an algorithmic program to solve the board game is in fact more fun.
创建Android棋盘游戏Sweet Spots是一个有趣的编程练习,尽管开发一个算法程序来解决棋盘游戏实际上更有趣。

A good portion of developing a board game is about validating the game state based on the game rules, creating game control logics, and visual look-and-feel. Much of the above is just mechanical programming exercise. On the other hand, solving the game is a vastly different exercise, requiring some algorithmic programming effort.
开发棋盘游戏的很大一部分是关于根据游戏规则验证游戏状态、创建游戏控制逻辑和视觉外观。以上大部分只是机械编程练习。另一方面,解决游戏是一项截然不同的练习,需要一些算法编程工作。

Sweet Spots under the hood
引擎盖下的甜蜜点

The game solving application was written in Java, with Ant the build tool. There are only two simple Java classes that constitute the hierarchical structure of the board game: Spot and Board. Another Java class, SweetSpots, consists of the core mechanism for game solving. Source code is available at GitHub.
游戏解决应用程序是用Java编写的,Ant是构建工具。只有两个简单的Java类构成了棋盘游戏的层次结构:Spot和Board。另一个Java类SweetSpots由游戏解决的核心机制组成。 源代码可在 GitHub 上找到。

Spot defines the (x,y)-coordinate position of each of the NxN spots (i.e. cells) on the board of size N. It also has an integer value that represents:
点定义大小为 N 的板上每个 NxN 点(即单元格)的 (x,y) 坐标位置。它还有一个整数值,表示:

  • empty (i.e. undecided spot)
    空(即未定位置)
  • filler (i.e. spot not chosen for treasure chest)
    填充物(即未为宝箱选择位置)
  • target (i.e. spot chosen for a treasure chest)
    目标(即为宝箱选择的地点)

In addition, it consists of an integer zone-id (0, 1, .., N-1) that represents the N individual zones.
此外,它还由一个整数区域 id (0, 1, .., N-1) 组成,该 ID 表示 N 个单独的区域。

Board defines the board size (N). The existing board game was developed with either 1 or 2 targets (i.e. number of treasure chests) per row/column/zone for a given game. For generality, the Board class consists of separate targets-per-row, targets-per-column and targets-per-zone, although they would all be the same (i.e. 1 or 2) when applying to the existing game version. It was initially generalized to allow rectangular board dimension (i.e. MxN instead of NxN), but was later simplified to square board.
电路板定义电路板尺寸 (N)。现有的棋盘游戏是针对给定游戏开发的,每行/列/区域有 1 或 2 个目标(即宝箱数量)。为了通用起见,Board 类由单独的每行目标、每列目标和每区域目标组成,尽管它们在应用于现有游戏版本时都是相同的(即 1 或 2)。它最初被推广为允许矩形板尺寸(即 MxN 而不是 NxN),但后来简化为方形板。

Board also consists of a variable, spots, of type Spot[][] that maintains the game state during a game, and a number of methods for game rule validation. Besides the standard constructor that takes board size and target count parameters, it also has a constructor for cloning itself to keep a snapshot of the board state.
棋盘还包含 Spots(类型为 Spot[][])的变量 spots(用于在游戏期间维护游戏状态)和许多用于游戏规则验证的方法。除了采用板大小和目标计数参数的标准构造函数外,它还具有一个用于克隆自身的构造函数,以保留板状态的快照。

Class SweetSpots is the class embedded with game solving logic. It takes a board-zone file along with target counts that defines a game as input and consists of necessary methods to solve the game. The board-zone file is a text file which contains the zone information of a given game in the form of a matrix with coordinates (0,0) representing the top left-most entry. For instance, a 4×4 board-zone file might have the following content:
类甜蜜点是嵌入了游戏解决逻辑的类。它需要一个棋盘区域文件以及目标计数,将游戏定义为输入,并包含解决游戏的必要方法。棋盘区域文件是一个文本文件,其中包含给定游戏的区域信息,其坐标(0,0)表示最左上角的条目。例如,4×4 板区文件可能包含以下内容:

The matrix of integers represent 4 zones, each represented by an integer (0-3):
整数矩阵表示 4 个区域,每个区域由一个整数 (0-3) 表示:

Class SweetSpots consists of a variable boardStack which is a stack of type LinkedList. The stack maintains a dynamic list of Board instance snapshots saved at various stages of the trial-and-error routine. The trial-and-error process is performed using two key methods, boardwalk() and rollback(). Method boardwalk() walks through each of the NxN spots on the board (hence “board walk”) in accordance with the game rules. Upon failing any of the game rule validation, rollback() handles rolling back to the previous game-solving state recorded in boardStack.
类 SweetSpot 由一个可变的 boardStack 组成,该堆栈是 LinkedList 类型的堆栈。堆栈维护在试错例程的各个阶段保存的 Board 实例快照的动态列表。试错过程使用两个关键方法执行,boardwalk() 和 rollback()。方法boardwalk()根据游戏规则遍历棋盘上的每个NxN点(因此称为“棋盘行走”)。在任何游戏规则验证失败时,rollback() 会处理回滚到 boardStack 中记录的先前游戏解决状态。

Below are pseudo-code logic for methods boardwalk() and rollback().
下面是方法 boardwalk() 和 rollback() 的伪代码逻辑。

Solving a game
解决游戏

Class SolveGame is a simple executable module that uses the SweetSpots class to solve a game with defined zone data.
类求解游戏是一个简单的可执行模块,它使用 SweetSpot 类来解决具有定义区域数据的游戏。

The main flow logic boils down to the following:
主流逻辑归结为以下内容:

To solve a game with defined zones, simply navigate to the main subdirectory of the java app and run the following command:
要解决具有定义区域的游戏,只需导航到 java 应用程序的主子目录并运行以下命令:

For instance, to solve a game defined in ./gamefiles/game-4-1.txt, simply navigate to the main subdirectory of the java app and run the following command from /path/to/sweetspot/java/:
例如,要解决 ./gamefiles/game-4-1.txt 中定义的游戏,只需导航到 java 应用程序的主子目录,然后从 /path/to/sweetspot/java/ 运行以下命令:

Creating a game
创建游戏

Class CreateGame is an executable module that creates a game by generating random zone data and guaranteeing a solution via repetitive method trials in class SweetSpots.
类创建游戏是一个可执行模块,它通过生成随机区域数据并通过类 SweetSpots 中的重复方法试验来保证解决方案来创建游戏。

Creating a game for a given board size (i.e. N) and target count involves:
为给定的棋盘大小(即 N)和目标计数创建游戏涉及:

  • generating N random contiguous zones that patition the board, and,
    生成 N 个随机连续区域,使电路板生效,并且,
  • ensuring there is a solution for the generated zones
    确保为生成的区域提供解决方案

Getting slightly more granular, it involves the following steps:
稍微更精细一些,它涉及以下步骤:

  1. Assign individual zones random sizes to fill the entire board: To reduce frequency of having extreme zone size, a simplistic weighted random probability distribution, triangularProbDist(), is used for determining the size for individual zones.
    分配单个区域的随机大小以填充整个板:为了减少具有极端区域大小的频率,使用简单的加权随机概率分布 triangularProbDist() 来确定单个区域的大小。
  2. For each zone, assign random contiguous spots of the assigned size on the board: Within class CreateGame is a method, zoneWalk(), which essentially “walks” row-wise or column-wise randomly till the assigned zone size is reached. Failure at any point of time to proceed further to cover the entire board with the zones will promptly force a return to step #1.
    对于每个区域,在棋盘上分配分配大小的随机连续点:在类中,CreateGame是一个方法,zoneWalk(),它基本上按行或按列随机“行走”,直到达到分配的区域大小。如果在任何时候未能进一步继续用区域覆盖整个电路板,将立即强制返回步骤#1。
  3. Transpose the board to further randomize the zone-walk result.
    转置电路板以进一步随机化区域行走结果。
  4. Repeat the above steps till the zones of assigned sizes successfully fill the entire board.
    重复上述步骤,直到指定尺寸的区域成功填满整个电路板。
  5. Ensure that there is a solution for the created zones: This is achieved by essentially employing the same game solving logic used in SolveGame.
    确保有针对所创建区域的解决方案:这是通过基本上采用与 SolveGame 中使用的相同游戏求解逻辑来实现的。

To create a game that consists of a solution, navigate to the main subdirectory of the java app and execute the following command:
要创建包含解决方案的游戏,请导航到 java 应用程序的主子目录并执行以下命令:

To create a game with 4×4 board size, go to /path/to/sweetspot/java/ and run the following command to generate the game zones in ./gamefiles/:
要创建 4×4 板大小的游戏,请转到 /path/to/sweetspot/java/ 并运行以下命令以在 ./gamefiles/ 中生成游戏区域:

The generated game-zone file should look something like the following:
生成的游戏区文件应如下所示:

The above Java applications were developed back in Summer of 2013. Though some refactoring effort has been made, there is certainly room for improvement in different areas. In particular, the zone creation piece can be designed to be more robust, ideally enforcing a unique solution for the game. That would be something for a different time perhaps in the near future. Meanwhile, enjoy the board game, which can be downloaded from Google Play.
上述Java应用程序是在2013年夏天开发的。尽管已经进行了一些重构工作,但在不同领域肯定有改进的余地。特别是,区域创建部分可以设计得更强大,理想地为游戏实施独特的解决方案。也许在不久的将来,这将是不同时期的事情。同时,享受棋盘游戏,可以从 谷歌播放 .

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Database CRUD In Scala-Play
Scala-Play 中的数据库 CRUD

In a recent startup venture, I adopted Node.js and Python, both dynamic-typing programming platforms, in the core technology stack of a data-centric web application. While I like both platforms for what they’re inherently good at, I still have some biased preference towards static-typing languages. Scala was in my considered list of platforms before I eventually settled on Node.js. I wasn’t going to just forget about what I liked about Scala though.
在最近的一家初创公司中,我在以数据为中心的Web应用程序的核心技术堆栈中采用了Node.js和Python,这两个动态类型编程平台。虽然我喜欢这两个平台,因为它们本质上擅长,但我仍然对静态类型语言有一些偏见。在我最终选择Node.js之前,Scala在我考虑过的平台列表中。不过,我不会忘记我喜欢Scala的地方。

Coming from a Math background, I have high regard for the benefit of functional programming. To me, Java has always been a great object-oriented programming (OOP) language for general-purpose application development. Although functional programming (FP) has been added since Java 8, the feature isn’t really an integral part of the language core. For that reason, I’m not going to start my pursuit of a static-typing programming platform that embodies OOP and FP with the Java platform.
来自数学背景,我高度重视函数式编程的好处。对我来说,Java一直是通用应用程序开发的面向对象编程(OOP)语言。尽管函数式编程(FP)自Java 8以来就已添加,但该功能并不是语言核心不可或缺的一部分。出于这个原因,我不打算开始追求一个静态类型编程平台,它体现了Java平台的OOP和FP。

Scala, Play and Reactive programming
Scala、Play 和 Reactive programming

Scala’s static-typing plus blending of object-oriented and functional programming in its language design make it an attractive programming language. The Play MVC framework, which comes with handy features like REST, asynchronous I/O in building web applications in Scala, has also picked up some momentum over the past few years. So, for general-purpose web-based application development, the Scala-Play combo sounds appealing for what I’m looking for.
Scala的静态类型以及语言设计中面向对象和函数式编程的混合使其成为一种有吸引力的编程语言。Play MVC框架具有方便的功能,如REST,在Scala中构建Web应用程序的异步I / O,在过去几年中也获得了一些发展势头。因此,对于基于Web的通用应用程序开发,Scala-Play组合听起来很吸引我正在寻找的东西。

Typesafe (being renamed to Lightbend as we speak) is a company founded by the authors of Scala and Akka. Advocating the Reactive programming paradigm, it provides a development tool Activator along with application templates to help streamline the development process, with emphasis in scalability, performance, resilience and non-blocking message-driven communication. The browser-based tool and reusable templates help make adopting the framework easier.
Typesafe(在我们说话时更名为Lightbend)是一家由Scala和Akka的作者创立的公司。它提倡响应式编程范式,提供了一个开发工具 Activator 以及应用程序模板,以帮助简化开发过程,重点是可伸缩性、性能、弹性和非阻塞消息驱动的通信。基于浏览器的工具和可重用的模板有助于更轻松地采用框架。

Data access layer: Anorm vs Squeryl vs Slick
数据访问层:Anorm vs Squeryl vs Slick

There are a few libraries/APIs for the data access layer in the Scala-Play ecosystem. Anorm provides functionality for parsing and transformation of query results from embedded plain SQL. Squeryl is an ORM (object-relational mapping) that supports composition of queries and explicit data retrieval strategies. Slick is a FRM (functional-relational mapping) and virtually handles data access operations like using of Scala collections. After some quick review, I decided to try out Anorm.
在Scala-Play生态系统中,有一些用于数据访问层的库/API。Anorm 提供了从嵌入式纯 SQL 解析和转换查询结果的功能。Squeryl 是一个 ORM(对象关系映射),支持查询组合和显式数据检索策略。Slick是一种FRM(功能关系映射),几乎可以处理数据访问操作,例如使用Scala集合。经过一些快速审查,我决定尝试Anorm。

As always, best way to get familiar with a programming platform is writing code. My hello-world application is a web application that handles database CRUD (create/read/update/delete) operations along with query pagination. Typesafe’s templates come in handy and help tremendously in providing the base code structure in various categories. There are templates with the data access layer using each of the three libraries. There is already a template with sample code for basic database operations and query pagination. That prompts me to bypass building the very basic stuff and instead focus on enhancement for expansion and maintainability.
与往常一样,熟悉编程平台的最佳方法是编写代码。我的hello-world应用程序是一个Web应用程序,用于处理数据库CRUD(创建/读取/更新/删除)操作以及查询分页。Typesafe 的模板派上用场,极大地帮助提供各种类别的基本代码结构。数据访问层的模板使用三个库中的每一个。已经有一个模板,其中包含用于基本数据库操作和查询分页的示例代码。这促使我绕过构建非常基本的东西,而是专注于扩展和可维护性的增强。

Creating a project from Typesafe’s templates is straight forward. To run the Reactive development platform, simply launch its web server at the command line under the project root with the following command (or just doubleclick the activator executable from within a file browser):
从Typesafe的模板创建项目非常简单。要运行 反应式开发平台 ,只需使用以下命令在项目根目录下的命令行启动其 Web 服务器(或从文件浏览器中双击激活器可执行文件):

./activator ui

Activator’s UI will be fired off on a web browser. You can compile and run the application via the UI and check out the launched application at http://localhost:9000/.
激活器的 UI 将在 Web 浏览器上触发。您可以通过 UI 编译和运行应用程序,并在 http://localhost:9000/ 查看启动的应用程序。

Looking under the hood
看引擎盖下

Based on the few templates I’ve downloaded, below is what a typical project root in the file system would look like:
根据我下载的几个模板,以下是文件系统中典型的项目根目录:

Like many other web-based MVC frameworks, much of the file system structure (app/, logs/, public/, test/) is pretty self-explanatory. The *.sbt files contain project-specific build scripts and package dependencies. Routing is configured within file conf/routes. The conf/evolutions/ subdirectory is for tracking database evolution scripts.
像许多其他基于 Web 的 MVC 框架一样,大部分文件系统结构(app/、logs/、public/、test/)都是不言自明的。*.sbt 文件包含特定于项目的生成脚本和包依赖项。路由在文件通信/路由中配置。conf/evolutions/ 子目录用于跟踪数据库演化脚本。

Despite my limited experience in Scala, the overall code included in the template is pretty easy to grasp. It’s self-contained and equipped with sample data, scripts for data schema, Scala modules for models, views and controllers, and even jQuery and Bootstrap libaries. After getting a good understanding of the skeletal code in the framework, I came up with a list of high-level changes to be made:
尽管我在 Scala 方面的经验有限,但模板中包含的整体代码非常容易掌握。它是独立的,并配备了示例数据,数据模式脚本,模型,视图和控制器的Scala模块,甚至jQuery和Bootstrap库。在对框架中的骨架代码有了很好的理解之后,我想出了一个要进行的高级更改列表:

  1. Expand the schema with multiple relational entities
    使用多个关系实体扩展架构
  2. Apply Anorm’s parsing, filtering and query pagination to multiple views
    将Anorm的解析,过滤和查询分页应用于多个视图
  3. Modularize controller code into multiple controllers
    将控制器代码模块化为多个控制器
  4. Add a navigation bar with some simple Bootstrap UI enhancement
    添加具有一些简单引导 UI 增强功能的导航栏

All of the above are pretty straight forward. A simple 3-table relational data model (song –N:1– musician –N:1– country) is created. SQL scripts for populating the tables are created under conf/evolutions/default/*.sql to make use of Play’s database evolution scripting mechanism. Forms are created for creating and editing songs and musicians. Filtering and query pagination are applied to the song lists and musician lists. Multiple controllers are created for modularity and maintainability.
以上所有内容都非常简单。创建一个简单的 3 表关系数据模型(歌曲 –N:1 – 音乐家 –N:1– 国家/地区)。用于填充表的 SQL 脚本是在 conf/evolutions/default/*.sql 下创建的,以利用 Play 的数据库演化脚本机制。创建表单用于创建和编辑歌曲和音乐家。筛选和查询分页应用于歌曲列表和音乐家列表。创建多个控制器以实现模块化和可维护性。

SQL statements in Anorm
Anorm 中的 SQL 语句

Plain SQL statements can be directly embedded into Play’s model. For instance, the update() function in app/models/Song.scala for updating the song table is as simple as follows:
普通SQL语句可以直接嵌入到Play的模型中。例如,app/models/Song.scala 中用于更新歌曲表的 update() 函数非常简单,如下所示:

Anorm SqlParser

Anorm appears to be a lean and mean library that allows developers to directly embed SQL statements into the application. At the same time, it provides flexible methods for parsing and transforming query results. One useful feature in Scala is its parser combinator which allows you to chain parsers to perform sequence of parsings of arbitrary text. For example, the following snippet in app/models/Song.scala shows the using of sequential parser combinator (~) to parse result set of the query from table “song”:
Anorm 似乎是一个精简而平均的库,允许开发人员直接将 SQL 语句嵌入到应用程序中。同时,它提供了灵活的方法来解析和转换查询结果。Scala 中一个有用的功能是它的解析器组合器,它允许您链接解析器以执行任意文本的解析序列。例如,app/models/Song.scala 中的以下代码片段显示了如何使用顺序解析器组合器 (~) 来解析表 “song” 中查询的结果集:

Both get[T] and the parser combinator (~) are methods defined within SqlParser as part of Anorm’s API:
get[T] 和解析器组合器 (~) 都是在 SqlParser 中定义的方法,作为 Anorm API 的一部分:

Query pagination

Pagination is done in an elegant fashion by means of a helper case class:
分页是通过帮助程序案例类以优雅的方式完成的:

The list() function in app/models/Song.scala is then defined with type Page[(Song, Option[Musician])]:
然后,app/models/Song.scala 中的 list() 函数定义为 Page[(Song, Option[Musician])]:

And the song list view, app/views/listSongs.scala.html, displays the song page information passed as currentPage, of type Page[(Song, Option[Musician])].
歌曲列表视图 app/views/listSongs.scala.html 显示作为当前页面传递的歌曲页面信息,类型为Page[(歌曲,选项[音乐家])]。

Passing request header as Implicit argument
将请求标头作为隐式参数传递

A navigation bar is added to the main html template, app/views/main.scala.html. To highlight the active <li> items on the nav-bar, Bootstrap’s active class is used. But since the html pages are all rendered from the server-side in this application, request header needs to be passed from the controllers to the associated views (forms, list pages, etc), which in turn pass the header to the main html template where the nav-bar is defined. In Scala-Play, it can be effectively done by passing the request header as the last Implicit argument (which minimizes breaking any existing code). For instance, the argument list within app/views/editSong.scala.html will be as follows:
导航栏将添加到主 html 模板中,app/views/main.scala.html。若要突出显示<li>导航栏上的活动项,请使用引导程序的活动类。但是,由于 html 页面都是从此应用程序中的服务器端呈现的,因此需要将请求标头从控制器传递到关联的视图(表单、列表页等),而关联视图又将标头传递到定义导航栏的主 html 模板。在 Scala-Play 中,可以通过将请求标头作为最后一个隐式参数传递来有效地完成此操作(这最大限度地减少了破坏任何现有代码)。例如,app/views/editSong.scala.html 中的参数列表将如下所示:

The request header passed to the main html template will then be consumed as highlighted below:
然后,将使用以下突出显示的方式使用传递给主 html 模板的请求标头:

Complete source code of the Scala-Play project is at GitHub.
Scala-Play项目的完整源代码在GitHub上。

My thoughts on Scala
我对 Scala 的看法

Scala runs on JVM (Java virtual machine) and supports standard Java libraries. Based on some quick review of Scala I did a couple of years ago, if you want to quickly pick up the language’s basics, consider Scala School. For a more comprehensive introduction of Scala, try Programming in Scala.
Scala在JVM(Java虚拟机)上运行,并支持标准的Java库。根据我几年前对Scala的一些快速回顾,如果你想快速掌握该语言的基础知识,请考虑Scala School。要更全面地介绍 Scala,请尝试 在 Scala 中编程 。

Unless you’re already familiar with some functional programming language like Haskell, Erlang, it isn’t the easiest language to pick up and takes some getting-used-to to read Scala code. But the apparent inconsistency in certain aspects of the language is what makes it hard to learn the language. For instance, the Scala language seems to support semicolon inference in a inconsistent fashion. Here’s an example:
除非你已经熟悉一些函数式编程语言,比如Haskell,Erlang,否则它不是最容易上手的语言,需要一些习惯才能阅读Scala代码。但是语言某些方面的明显不一致使得学习语言变得困难。例如,Scala语言似乎以不一致的方式支持分号推理。下面是一个示例:

The above snippet will fail because the compiler won’t infer semicolons within for(…). Switching to for{…} will fix the problem as semicolons are inferred in {} blocks.
上面的代码片段将失败,因为编译器不会推断 for(...) 中的分号。切换到 for{...} 将解决问题,因为分号是在 {} 块中推断的。

There is also some inconsistency in whether type inference is supported within an argument list versus across curried argument lists, as illustrated in this blog post by Paul Chiusano.
在参数列表中是否支持类型推断与跨柯里参数列表是否支持类型推断也存在一些不一致,如 Paul Chiusano 的这篇博客文章所示。

One must also make room in memory to memorize loads of symbols for various meanings in the language, as Scala loves smileys. A few examples below:
人们还必须在记忆中腾出空间来记住语言中各种含义的大量符号,因为 Scala 喜欢笑脸。下面是几个例子:

Conclusion

Nevertheless, I like Scala as an OOP + FP platform by design. I also like how the Play framework along with TypeSafe’s Activator provides the Reactive application development platform and streamlines development process for contemporary web-based applications.
尽管如此,我还是喜欢Scala作为一个OOP + FP平台的设计。我也喜欢Play框架和TypeSafe的Activator如何提供反应式应用程序开发平台,并简化当代基于Web的应用程序的开发过程。

Many other programming language platforms (Java, Ruby, etc) out there support OOP + FP to various extents, but Scala is still one of the few static-typing platforms providing a solid hybrid of OOP and FP. In addition, running on a JVM and supporting Java libraries are really a big plus for Scala.
许多其他编程语言平台(Java,Ruby等)在不同程度上支持OOP + FP,但Scala仍然是为数不多的提供OOP和FP混合的静态类型平台之一。此外,在JVM上运行并支持Java库对Scala来说确实是一大优势。

Having various choices of libraries/APIs in the data access layer allows engineers to pick what’s best for their needs. If you want an ORM, go for Squeryl; functional all the way, use Slick; embedded plain SQL with versatile parsing/filtering functions, Anorm it is.
在数据访问层中有多种库/API 选择,工程师可以选择最适合其需求的库/API。如果你想要一个ORM,去Squeryl;功能一路,使用光滑;嵌入式纯SQL具有多功能解析/过滤功能,Anorm就是这样。

Then, there is Akka’s Actor-based concurrency model included in Scala’s standard library. Actors are computational primitives that encapsulate state and behavior, and communicate via asynchronous message passing. The simple yet robust Actor model equipped with Scala’s OOP + FP programming semantics creates a powerful tool for building scalable distributed systems.
然后,Scala的标准库中包含了Akka基于Actor的并发模型。参与者是封装状态和行为并通过异步消息传递进行通信的计算基元。简单而健壮的Actor模型配备了Scala的OOP + FP编程语义,为构建可扩展的分布式系统创建了一个强大的工具。

1 thought on “Database CRUD In Scala-Play
关于“Scala-Play 中的数据库 CRUD”的 1 条思考

  1. Pingback: Scala Distributed Systems With Akka | Genuine Blog
    pingback: Scala 分布式系统 with Akka |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala Distributed Systems With Akka
Scala 分布式系统与 Akka

A recent R&D project prompted me to look into a programming platform for a distributed system. Storm coupled with Kafka popped up as a candidate, but as streaming wasn’t part of the system requirement, I decided to try out an actor-based system. Between Java and Scala, I had in mind Scala as the programming language primarily for its good mix of object-oriented and functional programming styles, as mentioned in a previous post about Scala-Play.
最近的一个研发项目促使我研究分布式系统的编程平台。Storm 和 Kafka 一起作为候选者出现,但由于流媒体不是系统要求的一部分,我决定尝试一个基于演员的系统。在Java和Scala之间,我认为Scala是编程语言,主要是因为它很好地结合了面向对象和函数式编程风格,正如之前关于Scala-Play的文章所提到的。

Naturally, Scala-Akka became a prime candidate. During the technology evaluation phase, I came across a couple of sample applications from Lightbend (formerly Typesafe) that I think are great tutorials for getting into Scala + Akka. Certain illustrated techniques for performance and scalability in some of the more comprehensive applications are particularly useful. Although the Play framework serves a great application development platform, it’s of less interest from a product functionality perspective.
自然,斯卡拉-阿卡成为主要候选人。在技术评估阶段,我遇到了来自Lightbend(以前称为Typesafe)的几个示例应用程序,我认为它们是进入Scala + Akka的好教程。在某些更全面的应用程序中,某些图示的性能和可伸缩性技术特别有用。尽管Play框架提供了一个出色的应用程序开发平台,但从产品功能的角度来看,它的兴趣不大。

Akka Actor Systems
阿卡演员系统

Akka is an open-source actor library/toolkit targeted for building scalable concurrent and distributed applications in Scala or Java. At the core of Akka are lightweight computational primitives called actors, each of which encapsulates state and behavior, and communicates via asynchronous immutable message passing.
Akka 是一个开源的 actor 库/工具包,旨在用 Scala 或 Java 构建可扩展的并发和分布式应用程序。Akka的核心是称为actor的轻量级计算原语,每个原语都封装了状态和行为,并通过异步不可变的消息传递进行通信。

It’s important to note that keeping messages immutable and non-blocking are some of the fundamental best-practices that the Actor model is designed for. In other words, they aren’t enforced by the model itself, hence it’s the developer’s responsibility to embrace the best practices. The underlying shared-nothing principle of actors with message passing the only means of interactions makes it an appealing concurrency model, as opposed to managing shared memory with locks in general thread-based concurrency.
请务必注意,保持消息不可变和非阻塞是 Actor 模型设计的一些基本最佳实践。换句话说,它们不是由模型本身强制执行的,因此开发人员有责任采用最佳实践。具有消息传递唯一交互方式的参与者的基本无共享原则使其成为一种有吸引力的并发模型,而不是在一般基于线程的并发中管理带有锁的共享内存。

Sample app #1: Pi approximation
示例应用 #1:Pi 近似

To quickly grasp how to use Akka Actors to solve computational problems, it might be worth taking a look at a sample application for approximating value of pi (i.e. ?). Although the sample application consists of deprecated code, I still find it a nice example for understanding how to craft out the computational components as actors and coordinate partial result passing via messages among the actors.
为了快速掌握如何使用 Akka Actor 来解决计算问题,可能值得看一下用于近似 pi 值的示例应用程序(即?尽管示例应用程序由已弃用的代码组成,但我仍然发现它是一个很好的示例,可以理解如何将计算组件制作为参与者,并协调通过参与者之间的消息传递的部分结果。

Given the dated code, one might want to just skim through the source code of the application while skipping the syntactic details. It shows how easy it is to formulate a distributed computation scheme by making the computation workers (Worker), aggregator (Master) and output listener (Listener) as actors, each playing different roles. A couple of notes:
鉴于过时的代码,人们可能只想浏览应用程序的源代码,同时跳过语法细节。它展示了通过将计算工作者(worker)、聚合器(master)和输出侦听器(Listener)作为参与者来制定分布式计算方案是多么容易,每个人都扮演不同的角色。几点注意事项:

  1. In general, “tell” (i.e. fire-and-forget) is preferred to “ask” in sending messages for performance reason. It makes sense in this application since it’s an approximation task, hence failure of a worker in a rare case isn’t end of the world.
    通常,出于性能原因,在发送消息时,“tell”(即发即弃)比“询问”更可取。它在这个应用程序中是有意义的,因为它是一个近似任务,因此在极少数情况下工人的失败并不是世界末日。
  2. Instead of having all actors defined in a single source file as in this example, actors are often defined separately in a slightly more complex application. It’s a common practice that actor classes are defined using companion objects in Scala. For instance, the Worker actor would be something like the following:
    与在本例中那样在单个源文件中定义所有参与者不同,通常在稍微复杂的应用程序中单独定义参与者。通常的做法是使用 Scala 中的配套对象定义 actor 类。例如,“工作线程”演员如下所示:

Sample app #2: Reactive maps
示例应用 #2:反应式映射

Lightbend provides a functionality-rich sample application, reactive maps, that illustrates a number of features centered around an actor system, including:
Lightbend 提供了一个功能丰富的示例应用程序 反应式贴图 ,它说明了许多以角色系统为中心的功能,包括:

  • GPS using HTML5 geolocation
    使用 HTML5 地理位置的 GPS
  • Bot simulation on geospatial maps
    地理空间地图上的机器人模拟
  • Play’s handling WebSockets with actors
    与演员一起玩处理网络套接字
  • Actor dependency injection
    参与者依赖注入
  • Akka’s peer-to-peer cluster
    阿卡的对等集群
  • Distributed publish/subscribe in cluster
    集群中的分布式发布/订阅
  • Akka persistence and journal
    阿卡坚持和日记
  • Akka cluster sharding
    阿卡集群分片
  • Reactive application deployment
    反应式应用程序部署

Like most of their sample application templates, reactive-maps comes with a tutorial that walks through the application. What I like about this one is that it starts with a more barebone working version and proceeds to enhance with more robust features. In the second half of the application walk-thru, a new feature for user travel distance tracking is created from scratch and then rewritten to address scalability issue by means of improved design of the associated actor as well as using of Akka persistence/journal and cluster sharding.
与大多数示例应用程序模板一样,反应式映射附带了一个教程,用于演练应用程序。我喜欢这个的地方是它从一个更准系统的工作版本开始,然后通过更强大的功能进行增强。在应用程序演练的后半部分,从头开始创建用户旅行距离跟踪的新功能,然后重写以通过改进相关参与者的设计以及使用 Akka 持久性/日志和集群分片来解决可扩展性问题。

Due to the rather wide range of features involved in the application, it might take some effort to go over the whole walk-thru. Nevertheless, I think it’s a worthy exercise to pick up some neat techniques in building a real-world application using Scala and Akka.
由于应用程序中涉及的功能相当广泛,因此可能需要花费一些精力来完成整个演练。尽管如此,我认为在使用 Scala 和 Akka 构建真实世界的应用程序时学习一些简洁的技术是一个值得练习的练习。

Deprecated Akka persistence interface
已弃用的 Akka 持久性接口

The source code seems to be rather up to date, although the deprecated Akka persistence interface EventsourcedProcessor does generate some compiler warning. To fix it, use trait PersistenActor instead and override the now abstract PersistenceId method. The relevant code after the fix should be as follows:
源代码似乎是最新的,尽管已弃用的 Akka 持久性接口 EventsourcedProcessor 确实生成了一些编译器警告。要修复它,请使用trait PersistenActor并重写现在抽象的PersistenceId方法。修复后的相关代码应如下所示:

/app/backend/UserMetaData.scala:
/app/backend/UserMetaData.scala:

Issue with dependency injection and cluster sharding
依赖注入和集群分片问题

There is a bug caught during compilation that arises from the binding for UserMetaData actors in the Play module, Actors.scala, responsible for initializing actors for the web frontend. Dependency injection is used in the module to bind actors of backend role that need to be created from the backend. The cluster-sharded UserMetaData actors now need to be created with the ClusterSharding extension hence requiring a special binding. This new binding causes an exception as follows:
在编译过程中捕获了一个错误,该错误源于Play模块Actor.scala中UserMetaData参与者的绑定,该模块负责初始化Web前端的Actor。模块中使用依赖注入来绑定需要从后端创建的后端角色的参与者。集群分片的用户元数据参与者现在需要使用集群分片扩展创建,因此需要特殊绑定。此新绑定会导致异常,如下所示:

It can be fixed by moving the ClusterSharding related code from class BackendActors into class UserMetaDataProvider, as follows:
可以通过将 ClusterSharding 相关代码从类 BackendActor 移动到类 UserMetaDataProvider 来修复它,如下所示:

/app/actors/Actors.scala:
/app/actor/Actor.scala:

Akka persistence journal dependency issue
Akka 持久性日志依赖项问题

Near the end of the example walk-thru is a custom journal setup using the key-value store Leveldb. However the setup fails at run-time with errors as follows:
示例演练的末尾是使用键值存储 Leveldb 的自定义日志设置。但是,安装程序在运行时失败,错误如下:

Some relevant bug reports in the Akka community suggests that it’s a problem with LevelDB’s dependency on some dated version of Google Guava. Since the default journal seems to work fine and the custom setup isn’t for production-grade journal anyway, I’m going to skip it. Source code with the above changes can be found at GitHub. In a production environment, one would probably want to use Redis, PostgreSQL, HBase, etc, for the persistence journal.
Akka社区中的一些相关错误报告表明,这是LevelDB对某些过时版本的Google Guava的依赖问题。由于默认日志似乎工作正常,并且自定义设置无论如何都不适用于生产级日志,因此我将跳过它。具有上述更改的源代码可以在 GitHub 上找到。在生产环境中,人们可能希望使用Redis,PostgreSQL,HBase等作为持久性日志。

Below is a screen-shot of the final version of the reactive-maps application.
下面是反应式映射应用程序最终版本的屏幕截图。

Scala Akka Reactive Maps

Final thoughts

Despite the described glitches, Lightbend’s reactive-maps application is a well-thought-out tutorial, methodically stepping through the thought process from design to implementation, along with helpful remarks related to real-world performance and scalability. Even though the sample application is primarily for illustration, it’s no trivial hello-world and a lot of demonstrated techniques could be borrowed or repurposed in a production-grade actor system.
尽管存在所描述的故障,Lightbend的反应式地图应用程序是一个经过深思熟虑的教程,有条不紊地逐步完成从设计到实现的思维过程,以及与实际性能和可扩展性相关的有用评论。尽管示例应用程序主要用于说明,但它不是微不足道的hello-world,许多演示的技术可以在生产级Actor系统中借用或重新利用。

As mentioned earlier in the post, I think Akka actor with its shared-nothing principle and non-blocking message-passing communication is a great alternative to the thread-based shared-memory concurrency model in which deadlocks and expensive context switching could be dreadful to deal with. However, in building a well-designed actor-based application, it does require proper architectural work and discipline for best-practices to componentize tasks into lightweight actors which interact among themselves by means of immutable message passing.
正如本文前面提到的,我认为 Akka actor 具有无共享原则和非阻塞消息传递通信,是基于线程的共享内存并发模型的绝佳替代方案,在这种模型中,死锁和昂贵的上下文切换可能很难处理。但是,在构建设计良好的基于参与者的应用程序时,确实需要适当的体系结构工作和最佳实践的纪律,以将任务组件化为轻量级参与者,这些轻量级参与者通过不可变的消息传递在它们之间进行交互。

On scalability, there are powerful features like actor cluster sharding and distributed publish/subscribe that allow one to build actor systems that can scale horizontally. And last but not least, Scala’s deep root in both object-oriented programming and functional programming makes it a effective tool for the coding task.
在可扩展性方面,有一些强大的功能,如参与者集群分片和分布式发布/订阅,允许人们构建可以水平扩展的参与者系统。最后但并非最不重要的一点是,Scala在面向对象编程和函数式编程方面的深厚根基使其成为编码任务的有效工具。

1 thought on “Scala Distributed Systems With Akka
关于“Scala 分布式系统与 Akka”的 1 条思考

  1. Pingback: Internet-of-Things And Akka Actors | Genuine Blog
    回溯:物联网和阿卡演员 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Internet-of-Things And Akka Actors
物联网和阿卡演员

IoT (Internet of Things) has recently been one of the most popular buzzwords. Despite being over-hyped, we’re indeed heading towards a foreseeable world in which all sorts of things are inter-connected. Before IoT became a hot acronym, I was heavily involved in building a Home-Area-Network SaaS platform over the course of 5 years in a previous startup I cofounded, so it’s no stranger to me.
IoT(物联网)最近成为最受欢迎的流行语之一。尽管被过度炒作,但我们确实正在走向一个可预见的世界,在这个世界中,各种事物都是相互关联的。在物联网成为热门首字母缩略词之前,我在我共同创立的一家初创公司中,在 5 年的时间里大量参与了家庭区域网络 SaaS 平台的构建,所以我并不陌生。

At the low-level device network layer, there used to be platform service companies providing gateway hardware along with proprietary APIs for IoT devices running on sensor network protocols (such as ZigBee, Z-Wave). The landscape has been evolving over the past couple of years. As more and more companies begin to throw their weight behind building products in the IoT ecosystem, open standards for device connectivity emerge. One of them is MQTT (Message Queue Telemetry Transport).
在低级设备网络层,曾经有平台服务公司为在传感器网络协议(如ZigBee,Z-Wave)上运行的物联网设备提供网关硬件以及专有API。在过去的几年里,情况一直在发展。随着越来越多的公司开始致力于在物联网生态系统中构建产品,设备连接的开放标准出现了。其中之一是MQTT(消息队列遥测传输)。

Message Queue Telemetry Transport
消息队列遥测传输

MQTT had been relatively little-known until it was standardized at OASIS a couple of years ago. The lightweight publish-subscribe messaging protocol, MQTT, has since been increasingly adopted by major players, including Amazon, as the underlying connectivity protocols for IoT devices. It’s TCP/IP based but its variant, MQTT-SN (MQTT for Sensor Networks), covers sensor network communication protocols such as ZigBee. There are also quite a few MQTT message brokers, including HiveMQ, Mosquitto and RabbitMQ.
MQTT在几年前在OASIS标准化之前一直鲜为人知。轻量级发布-订阅消息传递协议MQTT已被包括亚马逊在内的主要参与者越来越多地采用,作为物联网设备的底层连接协议。它基于TCP / IP,但其变体MQTT-SN(MQTT 用于传感器网络),涵盖了传感器网络通信协议,如ZigBee。还有相当多的MQTT消息代理,包括HiveMQ,Mosquitto和RabbitMQ。

IoT makes a great use case for Akka actor systems which come with lightweight loosely-coupled actors in decentralized clusters with robust routing, sharding and pub-sub features, as mentioned in a previous blog post. The actor model can be rather easily structured to emulate the operations of a typical IoT network that scales in device volume. In addition, availability of MQTT clients for Akka such as Paho-Akka makes it easy to communicate with MQTT brokers.
物联网为 Akka 参与者系统提供了一个很好的用例,如之前的博客文章所述,它在去中心化集群中具有轻量级松散耦合的参与者,具有强大的路由、分片和发布-订阅功能。参与者模型的结构相当容易,以模拟设备数量扩展的典型物联网网络的操作。此外,Akka的MQTT客户端(如Paho-Akka)的可用性使得与MQTT代理的通信变得容易。

A Scala-based IoT application
基于 Scala 的物联网应用程序

UPDATE: An expanded version of this application with individual actors representing each of the IoT devices, each of which maintains its own internal state and setting, is now available. Please see the Akka Actors IoT v.2 blog post for details.
更新:此应用程序的扩展版本现已可用,其中包含代表每个物联网设备的各个参与者,每个设备都维护自己的内部状态和设置。有关详细信息,请参阅 Akka Actors IoT v.2 博客文章。

In this blog post, I’m going to illustrate how to build a scalable distributed worker system using Akka actors to service requests from a MQTT-based IoT system. A good portion of the Akka clustering setup is derived from Lightbend’s Akka distributed workers template. Below is a diagram of the application:
在这篇博文中,我将说明如何使用 Akka actor 构建可扩展的分布式工作线程系统,以处理来自基于 MQTT 的物联网系统的请求。Akka集群设置的很大一部分来自Lightbend的Akka分布式工作线程模板。下面是应用程序示意图:

IoT with MQTT and Akka Actor Systems

As shown in the diagram, the application consists of the following components:
如图所示,应用程序由以下组件组成:

1. IoT

  • A DeviceRequest actor which:
    一个设备请求参与者,它:
    • simulates work requests from IoT devices
      模拟来自物联网设备的工作请求
    • publishes requests to a MQTT pub-sub topic
      将请求发布到 MQTT 发布-订阅主题
    • re-publishes requests upon receiving failure messages from a topic subscriber
      收到来自主题订阅者的失败消息时重新发布请求
  • An IotAgent actor which:
    一个物联网代理演员,它:
    • subscribes to the mqtt-topic for the work requests
      订阅工作请求的 MQTT 主题
    • sends received work requests via ClusterClient to the master cluster
      通过群集客户端将收到的工作请求发送到主群集
    • sends DeviceRequest actor a failure message upon receiving failure messages from Master actor
      在收到来自主参与者的失败消息时,向设备请求执行组件发送失败消息
  • A MQTT pub-sub client, MqttPubSub, for communicating with a MQTT broker
    MQTT pub-sub 客户端 MqttPubSub,用于与 MQTT 代理通信
  • A configuration helper object, MqttConfig, consisting of:
    配置帮助程序对象 MqttConfig,包括:
    • MQTT pub-sub topic
      MQTT 发布-订阅主题
    • URL for the MQTT broker
      MQTT 代理的网址
    • Serialization methods to convert objects to byte arrays, and vice versa
      将对象转换为字节数组的序列化方法,反之亦然

2. Master Cluster
2. 主集群

  • A fault-tolerant decentralized cluster which:
    一个容错的去中心化集群,它:
    • manages a singleton actor instance among the cluster nodes (with a specified role)
      管理群集节点之间的单一实例执行组件实例(具有指定角色)
    • delegates ClusterClientReceptionist on every node to answer external connection requests
      在每个节点上委派 ClusterClientReceptionist 以应答外部连接请求
    • provides fail-over of the singleton actor to the next-oldest node in the cluster
      提供单一实例参与者到群集中下一个最旧的节点的故障转移
  • A Master singleton actor which:
    单身演员大师:
    • registers Workers and distributes work to available Workers
      登记工作人员并将工作分配给可用的工作人员
    • acknowledges work request reception with IotAgent
      确认与物联网代理的工作请求接收
    • publishes work results to a work-results topic via Akka distributed pub-sub
      通过 Akka 分布式发布-订阅将工作结果发布到工作结果主题
    • maintains work states using persistence journal
      使用持久性日志维护工作状态
  • A PostProcessor actor in the master cluster which:
    主集群中的后处理器参与者,它:
    • simulates post-processing of the work results
      模拟工作结果的后处理
    • subscribes to the work-results topic
      订阅工作结果主题

3. Workers

  • An actor system of Workers each of which:
    工人的演员系统,每个工人:
    • communicates via ClusterClient with the master cluster
      通过集群客户端与主集群进行通信
    • registers with, pulls work from the Master actor
      注册,从大师演员那里提取作品
    • reports work status with the Master actor
      使用主演员报告工作状态
    • instantiates a WorkProcessor actor to perform the actual work
      实例化工作处理器参与者以执行实际工作
  • WorkProcessor actors which process the work requests
    处理工作请求的工作处理器参与者

Source code is available at GitHub.
源代码可在 GitHub 上找到。

A few notes:
一些注意事项:

  1. Neither IotAgent nor Worker actor system is a part of the master cluster, hence both need to communicate with the Master via ClusterClient.
    IotAgent 和 Worker actor 系统都不是主集群的一部分,因此两者都需要通过集群客户端与主集群通信。
  2. Rather than having the Master actor spawn child Workers and push work over, the Workers are set up to register with the Master and pull work from it – a model similar to what Derek Wyatt advocated in his post.
    与其让大师演员产生童工并推动工作,不如让工人向大师注册并从中取出工作——这种模式类似于德里克·怀亚特在他的帖子中倡导的模式。
  3. Paho-Akka is used as the MQTT pub-sub client with configuration information held within the helper object, MqttConfig.
    Paho-Akka 用作 MQTT pub-sub 客户端,其配置信息保存在帮助程序对象 MqttConfig 中。
  4. The helper object MqttConfig consists of MQTT pub-sub topic/broker information and methods to serialize/deserialize the Work objects which, in turn, contains Device objects. The explicit serializations are necessary since multiple JVMs will be at play if one launches the master cluster, IoT and worker actor systems on separate JVMs.
    帮助程序对象 MqttConfig 由 MQTT 发布-订阅主题/代理信息和序列化工作对象的方法组成,而工作对象又包含设备对象。显式序列化是必要的,因为如果在单独的 JVM 上启动主集群、IoT 和工作角色系统,多个 JVM 将发挥作用。
  5. The test Mosquitto broker at tcp://test.mosquitto.org:1883 serves as the MQTT broker. An alternative is to install a MQTT broker (Mosquitto, HiveMQ, etc) local to the IoT network.
    tcp://test.mosquitto.org:1883 测试Mosquitto代理充当MQTT代理。另一种方法是在本地安装物联网网络的MQTT代理(Mosquitto,HiveMQ等)。
  6. The IotAgent uses Actor’s ask method (?), instead of the fire-and-forget tell method (!), to confirm message receipt by the Master via a Future return. If the receipt confirmation is not so important, using the tell method will be a much preferred choice for performance.
    IotAgent 使用 Actor 的 ask 方法 (?),而不是即发即弃 tell 方法 (!),通过 Future 返回确认主服务器收到的消息。如果收货确认不是那么重要,则使用 tell 方法将是性能的首选。
  7. This is primarily a proof-of-concept application of IoT using Akka actors, hence code performance optimization isn’t a priority. In addition, for production systems, a production-grade persistence journal (e.g. Redis, Cassandra) should be used and multiple-Master via sharding could be considered.
    这主要是使用 Akka actor 的物联网概念验证应用,因此代码性能优化不是优先事项。此外,对于生产系统,应该使用生产级持久性日志(例如Redis,Cassandra),并且可以考虑通过分片进行多主数据库。

Test-running

Similar to how you would test-run Lightbend’s distributed workers template, you may open up separate command line terminals and run the different components on separate JVMs, adding and killing the launched components to observe how the systems scale out, fail over, persist work states, etc. Here’s an example of test-run sequence:
与测试运行 Lightbend 的分布式工作线程模板的方式类似,您可以打开单独的命令行终端并在单独的 JVM 上运行不同的组件,添加和终止启动的组件以观察系统如何横向扩展、故障转移、持久化工作状态等。下面是测试运行序列的示例:

Below are some sample console output.
下面是一些示例控制台输出。

Console Output: Master seed node with persistence journal:
控制台输出:具有持久性日志的主种子节点:

Console Output: IotAgent-DeviceRequest node:
控制台输出:IotAgent-DeviceRequest 节点:

Console Output: Worker node:
控制台输出:工作器节点:

2 thoughts on “Internet-of-Things And Akka Actors
关于“物联网和阿卡演员”的 2 条思考

  1. Pingback: Implicit Conversion In Scala | Genuine Blog
    pingback: Scala 中的隐式转换 |正版博客

  2. Pingback: Scala IoT Systems With Akka Actors II | Genuine Blog
    Pingback: Scala IoT Systems with Akka Actors II |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Persistence Journal Using Redis
使用 Redis 的 Akka 持久性日志

If you’ve used Lightbend’s Scala-Akka templates that involve persisting Akka actor states, you’ll notice that LevelDB is usually configured as the default storage medium for persistence journals (and snapshots). In many of these templates, a shared LevelDB journal is shared by multiple actor systems. As reminded by the template documentation as well as code-level comments, such setup isn’t suitable for production systems.
如果您使用过 Lightbend 的 Scala-Akka 模板,这些模板涉及持久化 Akka actor 状态,您会注意到 LevelDB 通常配置为持久性日志(和快照)的默认存储介质。在许多模板中,共享的LevelDB日志由多个参与者系统共享。正如模板文档以及代码级注释所提醒的那样,此类设置不适用于生产系统。

Thanks to the prospering Akka user community which maintains a good list of journal plugins you could pick from to suit your specific needs. Journal choices include Cassandra, HBase, Redis, PostgreSQL and others. In this blog post, I’m going to highlight how to set up Akka persistent journal using a plugin for Redis, which is one of the most popular open-source key-value stores.
感谢繁荣的 Akka 用户社区,该社区维护了一个很好的期刊插件列表,您可以从中选择以满足您的特定需求。期刊选择包括Cassandra,HBase,Redis,PostgreSQL等。在这篇博文中,我将重点介绍如何使用 Redis 插件设置 Akka 持久日志,Redis是最受欢迎的开源键值存储之一。

Redis client for Scala
适用于 Scala 的 Redis 客户端

First thing first, you’ll need a Redis server running on a server node you want your actor systems to connect to. If you haven’t already had one, download the server from Redis website and install it on a designated server host. The installation should include a command-line interface tool, redis-cli, that comes in handy for ad-hoc data update/lookup.
首先,你需要一个 Redis 服务器运行在你希望 actor 系统连接到的服务器节点上。如果您还没有服务器,请从 Redis 网站下载服务器并将其安装在指定的服务器主机上。安装应包括一个命令行界面工具 redis-cli,该工具对于临时数据更新/查找非常方便。

Next, you need a Redis client for Scala, Rediscala, which is a non-blocking Redis driver that wraps Redis requests/responses in Futures. To include the Rediscala in the application, simply specify it as a library dependency in build.sbt.
接下来,您需要一个用于 Scala 的 Redis 客户端 Rediscala,这是一个非阻塞的 Redis 驱动程序,它将 Redis 请求/响应包装在 Futures 中。要在应用程序中包含 Rediscala,只需在 build.sbt 中将其指定为库依赖项即可。

Redis journal plugin
红地日志插件

The Redis journal plugin is from Hootsuite. Similar to how Rediscala is set up in build.sbt, you can add the dependency for the Redis journal plugin. To tell sbt where to locate the Ivy repo for the journal plugin, you’ll also need to add a resolver as well. The build.sbt content should look something like the following:
Redis 日志插件来自 Hootsuite。与在 build.sbt 中设置 Rediscala 的方式类似,您可以添加 Redis 日志插件的依赖项。要告诉 sbt 在哪里可以找到日志插件的常春藤存储库,您还需要添加一个解析器。build.sbt 内容应如下所示:

Alternatively, rather than specifying them as dependencies you can clone the git repos for the Redis client and journal plugin, use sbt to generate a jar file for each of them, and include them in your application library (e.g. under /activator-project-root/lib/).
或者,与其将它们指定为依赖项,不如克隆 Redis 客户端和日志插件的 git 存储库,使用 sbt 为每个它们生成一个 jar 文件,并将它们包含在您的应用程序库中(例如,在 /activator-project-root/lib/ 下)。

Application configurations
应用程序配置

Now that the library dependency setup for Redis journal and Redis client is taken care of, next in line is to update the configuration information in application.conf to replace LevelDB with Redis.
现在,Redis 日志和 Redis 客户端的库依赖设置已经完成,接下来是更新 application.conf 中的配置信息,以将 LevelDB 替换为 Redis。

Besides Akka related configuration, the Redis host and port information is specified in the configuration file. The Redis journal plugin has the RedisJournal class that extends trait DefaultRedisComponent, which in turn reads the Redis host/port information from the configuration file and overrides the default host/port (localhost/6379) in the RedisClient case class within Rediscala.
除了与 Akka 相关的配置外,Redis 主机和端口信息也在配置文件中指定。Redis 日志插件具有 RedisJournal 类,该类扩展了 trait DefaultRedisComponent,该类又从配置文件中读取 Redis 主机/端口信息,并覆盖 Redisclient 案例类中 RedisClient 案例类中的默认主机/端口 (localhost/6379)。

As for the Akka persistence configuration, simply remove all LevelDB related lines in the configuration file and add the Redis persistence journal (and snapshot) plugin information. The application.conf content now looks like the following:
对于 Akka 持久性配置,只需删除配置文件中所有与 LevelDB 相关的行,并添加 Redis 持久性日志(和快照)插件信息即可。application.conf 内容现在如下所示:

Onto the application source code
到应用程序源代码上

That’s all the configuration changes needed for using Redis persistence journal. To retire LevelDB as the journal store from within the application, you can simply remove all code and imports that reference LevelDB for journal/snapshot setup. Any existing code logic you’ve developed to persist for LevelDB should now be applied to the Redis journal without changes.
这就是使用 Redis 持久性日志所需的所有配置更改。要从应用程序中停用 LevelDB 作为日志存储,您只需删除所有代码并导入引用 LevelDB 进行日志/快照设置的代码。您为持久化 LevelDB 而开发的任何现有代码逻辑现在都应应用于 Redis 日志,而无需更改。

In other words, this LevelDB to Redis journal migration is almost entirely a configurative effort. For general-purpose persistence of actor states, Akka’s persist method abstracts you from having to directly deal with Redis-specific interactions. Tracing the source code of Akka’s PersistentActor.scala, persist method is defined as follows:
换句话说,这种从 LevelDB 到 Redis 的日志迁移几乎完全是一项配置工作。对于参与者状态的通用持久性,Akka 的持久化方法使您不必直接处理特定于 Redis 的交互。追溯 Akka 的 PersistentActor.scala 的源代码,persist 方法定义如下:

For instance, a typical persist snippet might look like the following:
例如,典型的持久代码段可能如下所示:

In essence, as long as actor states are persisted with the proper method signature, any journal store specific interactions will be taken care of by the corresponding journal plugin.
从本质上讲,只要使用正确的方法签名持久化参与者状态,任何特定于日志存储的交互都将由相应的日志插件处理。

1 thought on “Akka Persistence Journal Using Redis
关于“使用 Redis 的 Akka 持久性日志”的 1 条思考

  1. Pingback: Scala IoT Systems With Akka Actors II | Genuine Blog
    Pingback: Scala IoT Systems with Akka Actors II |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

A Brief Encounter With Docker
与 Docker 的短暂相遇

Docker, an application-container distribution automation software using Linux-based virtualization, has gained a lot of momentum since it was released in 2013. I never had a chance to try it out but a current project has prompted me to bubble it up my ever-growing To-Do list. Below is a re-cap of my first two hours of experimenting with Docker.
Docker是一款使用基于Linux的虚拟化的应用程序容器分发自动化软件,自2013年发布以来已经获得了很大的发展势头。我从来没有机会尝试它,但当前的一个项目促使我把它放在我不断增长的待办事项列表中。以下是我对 Docker 实验的前两个小时的回顾。

First thing first, get a quick grasp of Docker’s basics. I was going to test it on a MacBook and decided to go for its beta version of Docker for Mac. It’s essentially a native app version of Docker Toolbox with a little trade-off of being limited to a single VM, which can be overcome if one uses it along side with Docker ToolBox. The key differences between the two apps are nicely illustrated at Docker’s website.
首先,快速掌握 Docker 的基础知识。我打算在MacBook上测试它,并决定使用Mac版Docker的测试版。它本质上是 Docker Toolbox 的原生应用程序版本,只需在单个 VM 上进行一些权衡,如果将其与 Docker ToolBox 一起使用,则可以克服这一点。这两个应用程序之间的主要区别在Docker的网站上得到了很好的说明。

Downloading and installing Docker for Mac was straight forward. Below is some configuration info about the installed software:
下载和安装 Docker for Mac 非常简单。以下是有关已安装软件的一些配置信息:

Next, it’s almost illegal not to run something by the name of hello-world when installing a new software. While at it, test run a couple of less trivial apps to get a feel of running Docker-based apps, including an Nginx and a Ubuntu Bash shell.
接下来,在安装新软件时不以 hello-world 的名义运行某些东西几乎是非法的。同时,测试运行几个不太琐碎的应用程序,以了解运行基于 Docker 的应用程序,包括 Nginx 和 Ubuntu Bash shell。

While running hello-world or Ubuntu shell is a one-time deal (e.g. the Ubuntu shell once exit is gone), the -d (for detach) run command option for Nginx would leave the server running in the background. To identify all actively running Docker containers and stop them, below is one quick way:
虽然运行hello-world或Ubuntu shell是一次易(例如,一旦退出,Ubuntu shell就消失了),但Nginx的-d(用于分离)运行命令选项将使服务器在后台运行。要识别所有正在运行的 Docker 容器并停止它们,下面是一种快速方法:

It’s also almost illegal to let any hello-world apps sitting around forever, so it’s a perfect candidate for testing image removal. You’ll have to remove all associated containers before removing the image. Here’s one option:
让任何hello-world应用程序永远闲置也几乎是非法的,因此它是测试图像删除的完美候选者。在删除映像之前,必须删除所有关联的容器。这里有一个选项:

Note that the above method only remove those containers with description matching the image name. In case an associated container lacking the matching name, you’ll need to remove it manually (docker rm ).
请注意,上述方法仅删除描述与映像名称匹配的容器。如果关联的容器缺少匹配的名称,则需要手动将其删除(docker rm )。

Adapted from Linux’s Cowsay game, Docker provides a Whalesay game and illustrates how to combine it with another Linux game Fortune to create a custom image. This requires composing the DockerFile with proper instructions to create the image as shown below:
改编自Linux的Cowsay游戏,Docker提供了一个Whalesay游戏,并说明了如何将其与另一个Linux游戏Fortune组合以创建自定义映像。这需要使用适当的说明来编写 DockerFile 来创建映像,如下所示:

Next, to manage your Docker images in the cloud, sign up for an account at Docker Hub. Similar to GitHub, Docker Hub allows you to maintain public image repos for free. To push Docker images to your Docker Hub account, you’ll need to name your images with namespace matching your user account’s. The easiest way would be to have the prefix of your image name match your account name.
接下来,要在云中管理 Docker 映像,请在 Docker Hub 注册一个帐户。与GitHub类似,Docker Hub允许您免费维护公共映像存储库。若要将 Docker 映像推送到 Docker 中心帐户,需要使用与用户帐户匹配的命名空间命名映像。最简单的方法是让映像名称的前缀与帐户名称匹配。

For instance, to push the fortune-whalesay image to Docker Hub with account name leocc, rename it to leocc/fortune-whalesay:
例如,要使用帐户名称 leocc 将 fortune-whalesay 映像推送到 Docker Hub,请将其重命名为 leocc/fortune-whalesay:

Finally, it’s time to try actually dockerize an app of my own and push it to Docker Hub. A Java app of a simple NIO-based Reactor server is being used here:
最后,是时候尝试实际Dockerize我自己的应用程序并将其推送到Docker Hub了。这里使用了一个简单的基于 NIO 的 Reactor 服务器的 Java 应用程序:

The Dockerized Java app is now at Docker Hub. Now that it’s in the cloud, you may remove the the local image and associated containers as described earlier. When you want to download and run it later, simply issue the docker run command.
Dockerized Java 应用程序现在位于 Docker Hub 上。现在,它位于云中,可以删除本地映像和关联的容器,如前所述。当您想下载并稍后运行它时,只需发出 docker run 命令即可。

My brief experience of exploring Docker’s basics has been positive. If you’re familiar with Linux and GitHub, picking up the commands for various tasks in Docker comes natural. As to the native Docker for Mac app, even though it’s still in beta it executes every command reliably as advertised.
我探索 Docker 基础知识的短暂经验是积极的。如果你熟悉Linux和GitHub,那么在Docker中学习各种任务的命令是很自然的。至于本机 Docker for Mac 应用程序,即使它仍处于测试阶段,它也能可靠地执行每个命令,如宣传的那样。

1 thought on “A Brief Encounter With Docker
关于“与Docker的短暂相遇”的1个想法

  1. Pingback: Self-contained Node.js Deployment | Genuine Blog
    pingback:自包含节点.js部署 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Self-contained Node.js Deployment
自包含节点.js部署

While setting up a Node.js environment on an individual developer’s machine can be done in a casual manner and oftentimes can be tailored to the developer’s own taste, deploying Node.js applications on shared or production servers requires a little more planning in advance.
虽然在单个开发人员的计算机上设置 Node.js 环境可以随意完成,并且通常可以根据开发人员自己的喜好进行定制,但在共享或生产服务器上部署 Node.js 应用程序需要提前进行更多规划。

To install Node.js on a server, a straight forward approach is to just follow some quick-start instructions from an official source. For instance, assuming latest v.4.x of Node.js is the target version and CentOS Linux is the OS on the target server, the installation can be as simple as follows:
要在服务器上安装 Node.js,一种直接的方法是遵循一些来自官方来源的快速入门说明。例如,假设最新的 Node.js v.4.x 是目标版本,而 CentOS Linux 是目标服务器上的操作系统,安装可以如此简单:

Software version: Latest versus Same
软件版本:最新与相同

However, the above installation option leaves the version of the installed Node.js out of your own control. Although the major release would stick to v.4, the latest update to Node available at the time of the command execution will be installed.
但是,上述安装选项使已安装的 Node.js 的版本不受您自己的控制。尽管主要版本将坚持使用 v.4,但将安装命令执行时可用的 Node 最新更新。

There are debates about always-getting-the-latest versus keeping-the-same-version when it comes to software installation. My take is that on individual developer’s machine, you’re at liberty to go for ‘latest’ or ‘same’ to suit your own need (for exploring experimental features versus getting ready for production support). But on servers for staging, QA, or production, I would stick to ‘same’.
在软件安装方面,关于始终获取最新版本与保持相同版本的争论。我的看法是,在单个开发人员的机器上,您可以自由选择“最新”或“相同”以满足您自己的需求(探索实验性功能而不是准备生产支持)。但是在用于暂存、QA 或生产的服务器上,我会坚持“相同”。

Some advocates of ‘latest’ even for production servers argue that not doing so could compromise security on the servers. It’s a valid concern but stability is also a critical factor. My recommendation is to keep version on critical servers consistent while making version update for security a separate and independently duty, preferably handled by a separate operations staff.
一些“最新”的倡导者认为,即使对于生产服务器,不这样做可能会损害服务器的安全性。这是一个合理的担忧,但稳定性也是一个关键因素。我的建议是保持关键服务器上的版本一致,同时使安全版本更新成为一项单独且独立的职责,最好由单独的操作人员处理。

Onto keeping a fixed Node.js version
保持固定的节点.js版本

As of this writing, the latest LTS (long-term-support) release of Node.js is v.4.4.7. The next LTS (v.6.x) is scheduled to be out in the next quarter of the year. There are a couple of options. Again, let’s assume we’re on CentOS, and that it’s CentOS 7 64-bit. There are a couple of options.
在撰写本文时,Node.js 的最新 LTS(长期支持)版本是 v.4.4.7。下一个LTS(v.6.x)计划在今年下个季度推出。有几个选项。再一次,让我们假设我们在 CentOS 上,它是 CentOS 7 64 位。有几个选项。

Option 1: Build from source
选项 1:从源代码构建

As a side note, if you’re on CentOS 6 or older, you’ll need to update gcc and Python.
作为旁注,如果你使用的是 CentOS 6 或更早版本,你需要更新 gcc 和 Python 。

Option 2: Use pre-built binary
选项 2:使用预构建的二进制文件

Note that both the above two options install a system-wide Node.js (which comes with the default package manager NPM) accessible to all legitimate users on the server host.
请注意,上述两个选项都安装了一个系统范围的节点.js(随默认包管理器 NPM 一起提供),可供服务器主机上的所有合法用户访问。

Node process manager
节点进程管理器

Next, install a process manager to manage processes of the Node app, providing features such as auto-restart. Two of the most prominent ones are forever and pm2. Let’s go with the slightly more robust one, pm2. Check for the latest version from the pm2 website and specify it in the npm install command:
接下来,安装进程管理器来管理 Node 应用程序的进程,提供自动重启等功能。其中最突出的两个是 永远 和 pm2 .让我们使用稍微健壮一点的pm2。从 pm2 网站检查最新版本,并在 npm install 命令中指定它:

Deploying self-contained Node.js
部署自包含节点.js

Depending on specific deployment requirements, one might prefer having Node confined to a local file structure that belongs to a designated user on the server host. Contrary to having a system-wide Node.js, this approach would equip each of your Node projects with its own Node.js binary and modules.
根据特定的部署要求,人们可能更喜欢将 Node 限制为属于服务器主机上指定用户的本地文件结构。与拥有系统范围的 Node.js 相反,这种方法将为每个 Node 项目配备自己的 Node.js二进制文件和模块。

Docker, as briefly touched on in a previous blog, would be a good tool in such use case, but one can also handle it without introducing an OS-level virtualization layer. Here’s how Node.js can be installed underneath a local Node.js project directory:
Docker,正如在之前的博客中简要提到的,在这样的用例中将是一个很好的工具,但也可以在不引入操作系统级虚拟化层的情况下处理它。以下是如何将 Node.js 安装在本地 Node.js 项目目录下的方法:

Next, create simple scripts to start/stop the local Node.js app (assuming main Node app is app.js):
接下来,创建简单的脚本来启动/停止本地 Node.js 应用程序(假设主节点应用程序是 app.js):

Script: $PROJDIR/bin/njsenv.sh (sourced by start/stop scripts)
脚本:$PROJDIR/bin/njsenv.sh(来源于启动/停止脚本)

Script: $PROJDIR/bin/start.sh
脚本:$PROJDIR/bin/start.sh

Script: $PROJDIR/bin/stop.sh
脚本:$PROJDIR/箱/停止.sh

It would make sense to organize such scripts in, say, a top-level bin/ subdirectory. Along with the typical file structure of your Node app such as controllers, routes, configurations, etc, your Node.js project directory might now look like the following:
在顶级 bin/ 子目录中组织此类脚本是有意义的。除了 Node 应用的典型文件结构(如控制器、路由、配置等)外,Node.js 项目目录现在可能如下所示:

Packaging/Bundling your Node.js app
打包/捆绑您的 Node.js 应用程序

Now that the key Node.js software modules are in place all within a local $PROJDIR subdirectory, next in line is to shift the focus to your own Node app and create some simple scripts for bundling the app.
现在关键的 Node.js 软件模块都已到位,这些模块都在本地$PROJDIR子目录中,接下来是将焦点转移到您自己的 Node 应用程序,并创建一些简单的脚本来捆绑应用程序。

This blog post is aimed to cover relatively simple deployment cases in which there isn’t need for environment-specific code build. Should such need arise, chances are that you might already be using a build automation tool such as gulp, which was heavily used by a Node app in a recent startup I cofounded. In addition, if the deployment requirements are complex enough, configuration management/automation tools like Puppet, SaltStack or Chef might also be used.
这篇博文旨在介绍相对简单的部署案例,在这些案例中,不需要特定于环境的代码构建。如果出现这样的需求,您可能已经在使用构建自动化工具,例如gulp,在我最近共同创立的一家初创公司中,Node应用程序大量使用了该工具。此外,如果部署要求足够复杂,也可以使用Puppet,SaltStack或Chef等配置管理/自动化工具。

For simple Node.js deployment that the app modules can be pre-built prior to deployment, one can simply come up with simple scripts to pre-package the app in a tar ball, which then gets expanded in the target server environments.
对于可以在部署之前预先构建应用程序模块的简单 Node.js 部署,只需提供简单的脚本即可将应用程序预打包到 tar 球中,然后在目标服务器环境中进行扩展。

To better manage files for the packaging/bundling task, it’s a good practice to maintain a list of files/directories to be included in a text file, say, include.files. For instance, if there is no need for environment-specific code build, package.json doesn’t need to be included when packaging in the QA/production environment. While at it, keep also a file, exclude.files that list all the files/directories to be excluded. For example:
为了更好地管理打包/捆绑任务的文件,最好维护要包含在文本文件中的文件/目录列表,例如 include.files。例如,如果不需要特定于环境的代码构建,则在 QA/生产环境中打包时不需要包含 package.json。同时,还要保留一个文件,exclude.files,列出要排除的所有文件/目录。例如:

Below is a simple shell script which does the packaging/bundling of a localized Node.js project:
下面是一个简单的 shell 脚本,它执行本地化 Node.js 项目的打包/捆绑:

Run bundling scripts from within package.json
从包中运行捆绑脚本.json

An alternative to doing the packaging/bundling with external scripts is to make use of npm’s features. The popular Node package manager comes with file exclusion rules based on files listed in .npmignore and .gitignore. It also comes with scripting capability that to handle much of what’s just described. For example, one could define custom file inclusion variable within package.json and executable scripts to do the packaging/bundling using the variables in the form of $npm_package_{var} like the following:
使用外部脚本进行打包/捆绑的另一种方法是使用 npm 的功能。流行的 Node 包管理器附带了基于 .npmignore 和 .gitignore 中列出的文件的文件排除规则。它还具有脚本功能,可以处理刚才描述的大部分内容。例如,可以在 package.json 和可执行脚本中定义自定义文件包含变量,以使用 $npm_package_{var} 形式的变量进行打包/捆绑,如下所示:

Here comes another side note: In the dependencies section, a version with prefix ~ qualifies any version with patch-level update (e.g. ~1.2.3 allows any 1.2.x update), whereas prefix ^ qualifies minor-level update (e.g. ^1.2.3 allows any 1.x.y update).
这是另一个旁注:在依赖项部分,带有前缀 ~ 的版本限定了具有补丁级别更新的任何版本(例如 ~1.2.3 允许任何 1.2.x 更新),而前缀 ^ 限定了次要级别更新(例如 ^1.2.3 允许任何 1.x.y 更新)。

To deploy the Node app on a server host, simply scp the bundled tar ball to the designated user on the host (e.g. scp $NAME-$VERSION.tgz njsapp@:package/) use a simple script similar to the following to extract the bundled tar ball on the host and start/stop the Node app:
要在服务器主机上部署 Node 应用程序,只需将捆绑的焦油球 scp 给主机上的指定用户(例如 scp $NAME-$VERSION.tgz njsapp@:p ackage/) 使用类似于以下内容的简单脚本提取主机上捆绑的焦油球并启动/停止 Node 应用程序:

Deployment requirements can be very different for individual engineering operations. All that has been suggested should be taken as simplified use cases. The main objective is to come up with a self-contained Node.js application so that the developers can autonomously package their code with version-consistent Node binary and dependencies. A big advantage of such approach is separation of concern, so that the OPS team does not need to worry about Node installation and versioning.
对于单个工程操作,部署要求可能大不相同。所有建议都应被视为简化的用例。主要目标是提出一个独立的 Node.js 应用程序,以便开发人员可以使用版本一致的 Node 二进制文件和依赖项自主打包他们的代码。这种方法的一大优点是关注点分离,因此 OPS 团队无需担心节点安装和版本控制。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Implicit Conversion In Scala
Scala 中的隐式转换

These days, software engineers with knowledge of robust frameworks/libraries are abundant, but those who fully command the core basics of a language platform remain scarce. When required to come up with coding solutions to perform, scale or resolve tricky bugs, a good understanding of the programming language’s core features is often the real deal.
如今,具有健壮框架/库知识的软件工程师很多,但那些完全掌握语言平台核心基础知识的软件工程师仍然很少。当需要提出编码解决方案来执行、扩展或解决棘手的错误时,对编程语言的核心功能有很好的理解通常是真正的交易。

Scala’s signature strengths

Having immersed in a couple of R&D projects using Scala (along with Akka actors) over the past 6 months, I’ve come to appreciate quite a few things it offers. Aside from an obvious signature strength of being a good hybrid of functional programming and object-oriented programming, others include implicit conversion, type parametrization and futures/promises. In addition, Akka actors coupled with Scala make a highly scalable concurrency solution applicable to many distributed systems including IoT systems.
在过去的6个月里,我沉浸在几个使用Scala(以及Akka演员)的研发项目中,我开始欣赏它提供的很多东西。除了作为函数式编程和面向对象编程的良好混合体的明显特征优势外,其他优势还包括隐式转换、类型参数化和期货/承诺。此外,Akka参与者与Scala相结合,使高度可扩展的并发解决方案适用于许多分布式系统,包括物联网系统。

In this blog post, I’m going to talk about Scala’s implicit conversion which I think is part of the language’s core basics. For illustration purpose, simple arithmetics of complex numbers will be implemented using the very feature.
在这篇博文中,我将讨论 Scala 的隐式转换,我认为这是该语言核心基础知识的一部分。为了便于说明,将使用该功能实现复数的简单算术。

A basic complex-number class would probably look something like the following:
一个基本的复数类可能如下所示:

Since a complex number can have zero imaginary component leaving only the real component, it’s handy to have an auxiliary constructor for those real-only cases as follows:
由于复数可以有零虚分量,只留下实分量,因此为那些仅实数情况使用辅助构造函数很方便,如下所示:

Just a side note, an auxiliary constructor must invoke another constructor of the class as its first action and cannot invoke a superclass constructor.
只是旁注,辅助构造函数必须调用类的另一个构造函数作为其第一个操作,并且不能调用超类构造函数。

Next, let’s override method toString to cover various cases of how a x + yi complex number would look:
接下来,让我们重写方法 toString 以涵盖 x + yi 复数的各种情况:

Let’s also fill out the section for the basic arithmetic operations:
让我们也填写基本算术运算的部分:

Testing it out …
测试一下...

So far so good. But what about this?
目前为止,一切都好。但是这个呢?

The compiler complains because it does not know how to handle arithmetic operations between a Complex and a Double. With the auxiliary constructor, ‘a + new Complex(1.0)’ will compile fine, but it’s cumbersome to have to represent every real-only complex number that way. We could resolve the problem by adding methods like the following for the ‘+’ method:
编译器抱怨,因为它不知道如何处理复数和双精度之间的算术运算。使用辅助构造函数,'a + new Complex(1.0)' 可以很好地编译,但必须以这种方式表示每个仅实数的复数很麻烦。我们可以通过为“+”方法添加如下方法来解决此问题:

But then what about this?
但是,这又如何呢?

The compiler interprets ‘a + 1.0’ as a.+(1.0). Since a is a Complex, the proposed new ‘+’ method in the Complex class can handle it. But ‘2.0 + b’ will fail because there isn’t a ‘+’ method in Double that can handle Complex. This is where implicit conversion shines.
编译器将“a + 1.0”解释为 a.+(1.0)。由于 a 是一个 Complex,因此 Complex 类中提出的新 '+' 方法可以处理它。但是“2.0 + b”将失败,因为 Double 中没有可以处理复杂值的“+”方法。这就是隐性转换闪耀的地方。

The implicit method realToComplex hints the compiler to fall back to using the method when it encounters a compilation problem associated with type Double. In many cases, the implicit methods would never be explicitly called thus their name can be pretty much arbitrary. For instance, renaming realToComplex to foobar in this case would get the same job done.
隐式方法 realToComplex 提示编译器在遇到与 Double 类型关联的编译问题时回退到使用该方法。在许多情况下,隐式方法永远不会被显式调用,因此它们的名称几乎是任意的。例如,在这种情况下,将realToComplex重命名为foobar将完成相同的工作。

As a bonus, arithmetic operations between Complex and Integer (or Long, Float) would work too. That’s because Scala already got, for instance, integer-to-double covered internally using implicit conversion in object Int, and in version 2.9.x or older, object Predef:
作为奖励,复数和整数(或长整型,浮点数)之间的算术运算也可以。这是因为 Scala 已经在内部使用对象 Int 中的隐式转换覆盖了整数到双精度,而在 2.9.x 或更早版本中,对象 Predef :

Testing again …
再次测试...

Implicit conversion scope
隐式转换范围

To ensure the implicit conversion rule to be effective when you use the Complex class, we need to keep it in scope. By defining the implicit method or importing a snippet containing the method in the current scope, it’ll certainly serve us well. An alternative is to define it in a companion object as follows:
为了确保隐式转换规则在使用 Complex 类时有效,我们需要将其保留在范围内。通过定义隐式方法或导入当前范围内包含该方法的代码段,它肯定会很好地为我们服务。另一种方法是在伴随对象中定义它,如下所示:

As a final note, in case factory method is preferred thus removing the need for the ‘new’ keyword in instantiation, we could slightly modify the companion object/class as follows:
最后,如果首选工厂方法从而消除了实例化中对“new”关键字的需求,我们可以稍微修改配套对象/类,如下所示:

Another quick test …
另一个快速测试...

2 thoughts on “Implicit Conversion In Scala
关于“Scala 中的隐式转换”的 2 条思考

  1. Pingback: Generic Merge Sort In Scala | Genuine Blog
    pingback: Scala 中的通用合并排序 |正版博客

  2. Pingback: Composing Partial Functions In Scala | Genuine Blog
    pingback: 在 scala 中组合部分函数 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Generic Merge Sort In Scala
Scala 中的通用合并排序

Many software engineers may not need to explicitly deal with type parameterization or generic types in their day-to-day job, but it’s very likely that the libraries and frameworks that they’re heavily using have already done their duty to ensure static type-safety via such parametric polymorphism feature.
许多软件工程师可能不需要在日常工作中明确处理类型参数化或泛型类型,但他们大量使用的库和框架很可能已经履行了职责,通过这种参数化多态性功能来确保静态类型安全。

In a static-typing functional programming language like Scala, such feature would often need to be used first-hand in order to create useful functions that ensure type-safety while keeping the code lean and versatile. Generics is apparently taken seriously in Scala’s inherent language design. That, coupled with Scala’s implicit conversion, constitutes a signature feature of Scala’s. Given Scala’s love of “smileys”, a few of them are designated for the relevant functionalities.
在像 Scala 这样的静态类型函数式编程语言中,通常需要直接使用此类功能,以便创建有用的函数,确保类型安全,同时保持代码精简和通用性。泛型显然在Scala固有的语言设计中被认真对待。这一点,再加上Scala的隐式转换,构成了Scala的一个标志性特征。鉴于 Scala 对“笑脸”的热爱,其中一些被指定用于相关功能。

Merge Sort

Merge Sort is a popular text-book sorting algorithm that I think also serves a great brain-teasing programming exercise. I have an old blog post about implementing Merge Sort using Java Generics. In this post, I’m going to use Merge Sort again to illustrate Scala’s type parameterization.
合并排序是一种流行的教科书排序算法,我认为它也是一个很好的大脑戏弄编程练习。我有一篇关于使用 Java 泛型实现合并排序的旧博客文章。在这篇文章中,我将再次使用合并排序来说明 Scala 的类型参数化。

By means of a merge function which recursively merge-sorts the left and right halves of a partitioned list, a basic Merge Sort function for integer sorting might be something similar to the following:
通过递归合并排序分区列表的左半部分和右半部分的合并函数,用于整数排序的基本合并排序函数可能类似于以下内容:

A quick test …
快速测试...

Contrary to Java Generics’ MyClass<T> notation, Scala’s generic types are in the form of MyClass[T]. Let’s generalize the integer Merge Sort as follows:
与Java Generics的MyClass<T>符号相反,Scala的泛型类型采用MyClass[T]的形式。让我们概括整数合并排序,如下所示:

The compiler immediately complains about the ‘<‘ comparison, since T might not be a type that supports ordering for ‘<‘ to make any sense. To generalize the Merge Sort function for any list type that supports ordering, we can supply a parameter in a curried form as follows:
编译器立即抱怨“<”比较,因为 T 可能不是支持对“<”进行排序以使其有任何意义的类型。为了推广任何支持排序的列表类型的合并排序函数,我们可以以柯里形式提供一个参数,如下所示:

Another quick test …
另一个快速测试...

That works well, but it’s cumbersome that one needs to supply the corresponding Ordering[T] for the list type. That’s where implicit parameter can help:
这很好用,但是需要为列表类型提供相应的排序[T]很麻烦。这就是隐式参数可以提供帮助的地方:

Testing again …
再次测试...

Note that the ‘if (lHead < rHead)’ condition is now replaced with ‘if (order.lt(lHead, rHead))’. That’s because math.Ordering defines its own less-than method for generic types. Let’s dig a little deeper into how it works. Scala’s math.Ordering extends Java’s Comparator interface and implements method compare(x: T, y: T) for all the common types, Int, Long, Float, Double, String, etc. It then provides all these lt(x: T, y: T), gt(x: T, y: T), …, methods that know how to perform all the less-than, greater-than comparisons for various types.
请注意,“if (lHead < rHead)”条件现在已替换为“if (order.lt(lHead, rHead))”。那是因为数学。排序为泛型类型定义自己的小于方法。让我们更深入地了解它是如何工作的。斯卡拉的数学。排序扩展了Java的比较器接口,并实现了所有常见类型(Int,Long,Float,Double,String等)的方法compare(x:T,y:T)。然后,它提供了所有这些lt(x: T, y: T), gt(x: T, y: T), ...,这些方法知道如何对各种类型的所有小于,大于比较。

The following are highlights of math.Ordering’s partial source code:
以下是数学的亮点。排序的部分源代码:

Context Bound

Scala provides a typeclass pattern called Context Bound which represents such common pattern of passing in an implicit value:
Scala提供了一个名为Context Bound的类型类模式,它表示传入隐式值的常见模式:

With the context bound syntactic sugar, it becomes:
使用上下文绑定的句法糖,它变成:

The mergeSort function using context bound looks as follows:
使用上下文绑定的 mergeSort 函数如下所示:

Note that ‘implicitly[Ordering[T]]’ is there for access to methods in math.Ordering which is no longer passed in with a parameter name.
请注意,“隐式[排序[T]]”用于访问数学中的方法。不再使用参数名称传入的排序。

Scala’s math.Ordered versus math.Ordering
斯卡拉的数学。有序与数学。订购

One noteworthy thing about math.Ordering is that it does not overload comparison operators ‘<‘, ‘>‘, etc, which is why method lt(x: T, y: T) is used instead in mergeSort for the ‘<‘ operator. To use comparison operators like ‘<‘, one would need to import order.mkOrderingOps (or order._) within the mergeSort function. That’s because in math.Ordering, comparison operators ‘<‘, ‘>‘, etc, are all defined in inner class Ops which can be instantiated by calling method mkOrderingOps.
关于数学的一件值得注意的事情。排序是它不会重载比较运算符'<'、'>'等,这就是为什么方法lt(x: T, y: T)在mergeSort中使用'<'运算符的原因。 要使用像“<”这样的比较运算符,需要在mergeSort函数中导入order.mkOrderingOps(或order._)。 这是因为在 math.Ordering 中,比较运算符 '<'、'>' 等都是在内部类 Ops 中定义的,可以通过调用方法 mkOrderingOps 来实例化。

Scala’s math.Ordered extends Java’s Comparable interface (instead of Comparator) and implements method compareTo(y: T), derived from math.Ordering’s compare(x: T, y: T) via implicit parameter. One nice thing about math.Ordered is that it consists of overloaded comparison operators.
斯卡拉的数学。有序扩展了Java的Comparable 接口(而不是Comparator),并实现了从数学派生的方法compareTo(y: T)。排序的比较(x: T, y: T) 通过隐式参数。关于数学的一件好事。有序的是它由重载的比较运算符组成。

The following highlights partial source code of math.Ordered:
下面重点介绍数学的部分源代码。命令:

Using math.Ordered, an implicit method, implicit orderer: T => Ordered[T], (as opposed to an implicit value when using math.Ordering) is passed to the mergeSort function as a curried parameter. As illustrated in a previous blog post, it’s an implicit conversion rule for the compiler to fall back to when encountering problem associated with type T.
使用数学。有序,隐式方法,隐式排序器:T => 有序[T],(与使用数学时的隐式值相反。排序)作为柯里参数传递给 mergeSort 函数。如之前的博客文章所示,这是编译器在遇到与类型 T 相关的问题时回退的隐式转换规则。

Below is a version of generic Merge Sort using math.Ordered:
下面是使用数学的通用合并排序版本。命令:

View Bound

A couple of notes:
几点注意事项:

  1. The implicit method ‘implicit orderer: T => Ordered[T]’ is passed into the mergeSort function also as an implicit parameter.
    隐式方法“隐式排序器:T => Ordered[T]”也作为隐式参数传递到 mergeSort 函数中。
  2. Function mergeSort has a signature of the following common form:
    函数 mergeSort 具有以下常见形式的签名:

Such pattern of implicit method passed in as implicit paramter is so common that it’s given the term called View Bound and awarded a designated smiley ‘<%’. Using view bound, it can be expressed as:
这种作为隐式参数传入的隐式方法模式非常普遍,以至于它被赋予了称为 View Bound 的术语,并被授予指定的笑脸“<%”。使用视图绑定,它可以表示为:

Applying to the mergeSort function, it gives a slightly more lean and mean look:
应用于 mergeSort 函数,它给出了一个稍微精简和平均的外观:

As a side note, while the view bound looks like the other smiley ‘<:’ (Upper Bound), they represent very different things. An upper bound is commonly seen in the following form:
作为旁注,虽然视图边界看起来像另一个笑脸“<:”(上限),但它们代表非常不同的东西。上限通常以以下形式出现:

This means someFunction takes only input parameter of type T that is a sub-type of (or the same as) type S. While at it, a Lower Bound represented by the ‘>:’ smiley in the form of [T >: S] means the input parameter can only be a super-type of (or the same as) type S.
这意味着 someFunction 仅接受类型 T 的输入参数,该参数是 S 类型(或与 S 类型相同)的子类型。此时,由 [T >: S] 形式的“>:”笑脸表示的下限表示输入参数只能是类型 S 的超类型(或与 S 相同)。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Startup Culture 2.0
创业文化2.0

Startup has been a household term since the early/mid 90’s when the World Wide Web (now a nostalgic term) began to take the world by storm. Triggered by the popular graphical browser, Mosaic, the blossoming of the Web all of a sudden opened up all sorts of business opportunities, attracting entrepreneurs to try capitalize the newly born eye-catching medium.
自 90 年代初/中期以来,创业一直是一个家喻户晓的术语,当时万维网(现在是一个怀旧的术语)开始席卷全球。在流行的图形浏览器Mosaic的触发下,网络的蓬勃发展突然开辟了各种商机,吸引了企业家尝试利用这种新生的引人注目的媒体。

A historically well known place for technology entrepreneurship, the Silicon Valley (or more precisely, San Francisco Bay Area) became an even hotter spot for entrepreneurs to swamp in. Many of these entrepreneurs were young energetic college graduates (or drop-outs) in some science/engineering disciplines who were well equipped to quickly learn and apply new things in the computing area. They generally had a fast-paced work style with a can-do spirit. Along with the youthful work-hard play-hard attitude, the so-called startup culture was born. Sun Microsystems was probably a great representative of companies embracing the very culture back in the dot-com era.
硅谷(或更准确地说,旧金山湾区)在历史上以科技创业而闻名,成为企业家涌入的更热门的地方。这些企业家中的许多人是一些科学/工程学科的年轻精力充沛的大学毕业生(或辍学者),他们有能力在计算领域快速学习和应用新事物。他们通常具有快节奏的工作风格和进取精神。伴随着年轻人努力工作、努力玩耍的态度,所谓的创业文化诞生了。Sun Microsystems可能是在互联网时代拥抱这种文化的公司的一个伟大代表。

So, that was a brief history, admittedly unofficial, of the uprising of startup culture 1.0.
所以,这是创业文化1.0起义的一段简短的历史,诚然是非官方的。

Setting up an open-space engineering room

Setting up an engineering workspace
设置工程工作区

Version 2.0

This isn’t another ex-dot-commer glorifying the good old startup culture in the dot-com days that later degenerated into a less commendable version observed today. Rather, it’s just my observation over recent years the gradual, subtle changes to the so-called startup culture.
这不是另一个前互联网公司美化互联网时代的旧创业文化,后来退化为今天观察到的不那么值得称赞的版本。相反,这只是我近年来观察到的所谓创业文化逐渐微妙的变化。

Heading into startup culture 2.0 (an obviously arbitrary version number), besides an emphasis of fast-paced and can-do, along came a number of phenomenons including long hours, open-space and the Agile “movement”.
进入创业文化2.0(一个明显任意的版本号),除了强调快节奏和可以做之外,还出现了许多现象,包括长时间工作,开放空间和敏捷“运动”。

Long hours

In 1.0, we saw a lot of startup technologists voluntarily working long hours in the office. Glorified stories of techies literally camping in the office live on. Company management nowadays still believe (or want to believe) that software engineers working long hours is a signature of startup culture.
在1.0中,我们看到许多创业技术人员自愿在办公室长时间工作。技术人员在办公室露营的美化故事还在继续。如今的公司管理层仍然相信(或想要相信)软件工程师长时间工作是创业文化的标志。

In reality, working long-hours is no longer a “voluntary” phenomenon, if it ever was. Why is that? I believe one of the reasons is that a good portion of the techies in startups today are veterans who prefer a work-life balance with a work schedule that focuses on productive hours rather than number of hours. Another reason is that working remote is now a more feasible solution due to improved Internet connectivity at home, thus diminishing the need to stay in the office for network access.
实际上,长时间工作不再是一种“自愿”现象,如果曾经是的话。为什么?我相信其中一个原因是,当今初创公司的很大一部分技术人员都是退伍军人,他们更喜欢工作与生活的平衡,工作时间表侧重于生产时间而不是小时数。另一个原因是,由于改善了家中的互联网连接,远程工作现在是一种更可行的解决方案,从而减少了留在办公室进行网络访问的需要。

The fact is that serious software engineering requires serious brain-work. One can only deliver a certain number of hours of quality work on any given day in a productive manner. Forcing engineers to pull long hours in the office beyond the normal productive limit might just result in higher code quantity but lower code quality. Worse yet, genuine software engineering enthusiasts tend to contribute bonus work at some ad-hoc high-creativity moments outside of office hours, but the forced long hours will likely kill any stamina or incentive left to do so.
事实是,严肃的软件工程需要认真的脑力劳动。一个人只能在任何一天以富有成效的方式提供一定数量的高质量工作。迫使工程师在办公室长时间工作,超出正常的生产限制,可能只会导致代码数量增加,但代码质量降低。更糟糕的是,真正的软件工程爱好者倾向于在办公时间之外的一些临时高创造力时刻贡献额外的工作,但被迫的长时间工作可能会扼杀任何耐力或动力。

Flexibility in work hours and locale
工作时间和地点的灵活性

Back in 1.0, general Internet connection speed was slow. It was common for a good-sized company with nationwide offices to share a T1 line as their Internet backbone, whereas today many residential consumers use connections at an order of magnitude faster than a T1. So, in order to carry out productive work back then, people had to go to the office, thus oftentimes you could find software engineers literally camping in the office.
回到1.0,一般的互联网连接速度很慢。对于一家拥有全国办事处的大型公司来说,共享一条T1线路作为他们的互联网骨干是很常见的,而今天许多住宅消费者使用连接的速度比T1快一个数量级。因此,为了在当时进行富有成效的工作,人们不得不去办公室,因此您经常可以找到软件工程师在办公室露营。

Given the vastly improved residential Internet infrastructure today, much of the engineering work that used to be doable only in the office in the past can be done at home. So, if a software engineer already has regular office presence, there is little to no reason to work long hours in the office. In fact, other than pre-scheduled group meetings and white-boarding sessions, engineers really don’t have to stay in the office all the time, especially for those who have to endure bad commute.
鉴于当今住宅互联网基础设施的大幅改善,过去只能在办公室完成的许多工程工作都可以在家中完成。因此,如果软件工程师已经有固定的办公室存在,那么几乎没有理由在办公室长时间工作。事实上,除了预先安排的小组会议和白板会议之外,工程师真的不必一直呆在办公室里,特别是对于那些不得不忍受糟糕通勤的人。

Open office space
开放式办公空间

Open-plan office has been advocated since the middle of the 1.0 era. Common office cubes used to be 5-1/2 feet or taller in the old days. People later began to explore opening up a visually-cluttered office space by cutting down about a foot from the cube wall, allowing individuals to talk face to face when standing up while keeping some privacy when sitting down. Personally, I think that’s the optimal setting. But then in recent years, lots of startups adopted cubes with walls further lowered or completely removed, essentially enforcing a surround-sound “multicast” communication protocol all the time for individuals. In addition, the vast open view would also ensure constant visual distractions.
自1.0时代中期以来,开放式办公室一直受到提倡。过去,普通的办公室立方体曾经是 5-1/2 英尺或更高。后来,人们开始探索通过从立方体墙上砍下约一英尺来开辟一个视觉上杂乱的办公空间,允许个人在站立时面对面交谈,同时在坐下时保持一些隐私。就个人而言,我认为这是最佳设置。但近年来,许多初创公司采用了立方体,墙壁进一步降低或完全移除,基本上一直为个人强制执行环绕声“多播”通信协议。此外,广阔的视野也将确保持续的视觉干扰。

Bear in mind software engineers need good chunks of solitary time to conduct their coding work. Such a multicast plus visually distracting environment isn’t going to provide a productive environment for them. It’s understandable for a cash-strapped early-stage startup to adopt a temporary economical seating solution like that, but I’ve seen many well-funded companies building out such workspace as their long-term work environment.
请记住,软件工程师需要大量的单独时间来进行编码工作。这样的多播加上视觉上分散注意力的环境不会为他们提供高效的环境。对于一家资金拮据的早期创业公司来说,采用这样的临时经济型座椅解决方案是可以理解的,但我见过许多资金雄厚的公司将这样的工作空间作为他们的长期工作环境。

Agile

Virtually every software engineering organization is practicing some sort of Agile process. I like certain core Agile practices like 2-week development sprint, daily 15-minute scrum, and continuous integration, which I think are applicable to many software development projects. But I’m against mechanically adopting everything advocated in the Agile methodology.
实际上,每个软件工程组织都在实践某种敏捷过程。我喜欢某些核心敏捷实践,如 2 周的开发冲刺、每天 15 分钟的 Scrum 和持续集成,我认为这些实践适用于许多软件开发项目。但我反对机械地采用敏捷方法论中提倡的一切。

Which aspects of the Agile process are to be adopted for an engineering organization should be evaluated, carried out and adjusted in accordance with engineers skillset, project type and operation environment. The primary goal should be for efficiency and productivity, not because Agile sounds cool.
应根据工程师的技能组合、项目类型和操作环境对工程组织采用敏捷过程的哪些方面进行评估、执行和调整。主要目标应该是效率和生产力,而不是因为敏捷听起来很酷。

Insecurity and distrust?
不安全感和不信任?

With typically limited budget and tight timeline in mind, management tend to be nervous about whether their engineering hires deliver work as best as they could. Long hours in office would make believe that things are moving along at above the average speed. Open space further eases the anxiety through seeing-is-believing. And frequent sprints and daily scrums ensure maximum work output and microscopic measurement of individuals performance.
考虑到通常有限的预算和紧迫的时间表,管理层往往会对他们的工程员工是否尽可能好地交付工作感到紧张。长时间在办公室工作会让人相信事情正在以高于平均速度的速度发展。开放空间通过眼见为实进一步缓解焦虑。频繁的冲刺和日常混战确保了最大的工作产出和对个人表现的微观测量。

If those are the motives observed by the software engineers, most likely they won’t be happy and the most competent ones will be the first to flee. Nor will the management be happy when they don’t see the expected high productivity and find it hard to retain top engineers. The end result is that despite all the rally of fun, energetic startup culture on the company’s hiring web page, people hardly feel fun or energy there.
如果这些是软件工程师观察到的动机,他们很可能不会高兴,最有能力的人会第一个逃离。当管理层没有看到预期的高生产率并发现很难留住顶级工程师时,他们也不会高兴。最终的结果是,尽管公司的招聘网页上充满了有趣、充满活力的创业文化,但人们在那里几乎感觉不到乐趣或活力。

What can be done better?
还有什么可以做得更好?

Management:

  1. Give your staff benefit of doubt — It’s hard to let go of the doubt about whether people are working as hard as expected, but pushing for long hours in the office and keeping them visually exposed at their desks only send a signal of distrust and insecurity. By pushing for long hours in the office, management are in essence commodifying software engineering work to become some hourly-paid kind of mechanical work. It’ll only backfire and may just result in superficial punch-clock office attendance with low productivity. I would also recommend making the work hours as flexible as possible. On working remote, a pre-agreed arrangement of telecommuting would go a long way for those who must endure long commute. People with enough self-respect would appreciate demonstrated trust from the management and it would make them enjoy their job more, thus produce better work.
    给你的员工带来怀疑的好处——很难消除人们对人们是否像预期的那样努力工作的怀疑,但长时间在办公室工作并让他们在办公桌前保持视觉暴露只会发出不信任和不安全感的信号。通过在办公室长时间工作,管理层实质上是将软件工程工作商品化,成为某种按小时付费的机械工作。这只会适得其反,并可能导致肤浅的打卡班时间,工作效率低下。我还建议使工作时间尽可能灵活。在远程工作方面,预先商定的远程办公安排对于那些必须忍受长途通勤的人来说将大有帮助。具有足够自尊心的人会欣赏管理层的信任,这将使他们更享受自己的工作,从而产生更好的工作。
  2. Work healthy — Work hard and play hard is almost a synonym of startup culture that we believe fun is a critical aspect in a startup environment. But throwing in a ping pong or foosball table would not automatically spawn a fun environment. In building out a work environment, I would argue that “work healthy” perhaps should replace “fun” as the primary initiative. I think providing a healthy working environment will lead to happier staff, better productivity, and fun will come as a bonus. Common ways to achieve that include suitable office plan, ergonomic office furniture, natural lighting, facility for exercise, workout subsidy programs, healthy snacks, or even a room for taking naps. Speaking of naps, I think it’s worth serious consideration to embrace it as part of the culture. Evidently, a 15-30 minutes of nap after lunch can do magic in refreshing one’s mood and productivity for the remaining half of the day.
    健康工作 — 努力工作和玩耍几乎是创业文化的同义词,我们相信乐趣是创业环境中的一个关键方面。但是,打乒乓球或桌上足球不会自动产生一个有趣的环境。在建立工作环境时,我认为“健康工作”也许应该取代“乐趣”作为主要举措。我认为提供一个健康的工作环境将带来更快乐的员工,更高的生产力和乐趣将是一种奖励。实现这一目标的常见方法包括合适的办公室计划、符合人体工程学的办公家具、自然采光、运动设施、锻炼补贴计划、健康零食,甚至是小睡的房间。说到小睡,我认为将其作为文化的一部分是值得认真考虑的。显然,午餐后小睡15-30分钟可以神奇地刷新一个人的情绪和一天剩余的生产力。
  3. Adopt Agile with agility — Take only what best suit your staff’s skillset, project type and operational environment. Stick to the primary goal of better productivity and efficiency, and add/remove Agile practices as needed. It’s also important that you regularly communicate with the staff for feedback and improvement, and make adaptive changes to the practices if necessary.
    敏捷地采用敏捷 — 只选择最适合员工技能组合、项目类型和运营环境的方法。坚持提高生产力和效率的主要目标,并根据需要添加/删除敏捷实践。同样重要的是,您要定期与员工沟通以获得反馈和改进,并在必要时对做法进行适应性更改。
  4. Product development feedback — Despite all the development methodologies with fancy names, there is often a disconnect between actual engineering progress and product development plan. A common tactic is to assemble a unfulfillable aggressive development plan to try push for maximum engineering output. Unfortunately, such practice often results in disrespect to development timelines or inferior product quality with a ballooned overdue tech debt. A better approach would be to maintain an effective feedback loop that constantly provides data (including actual progress estimates, tech debt clearance needs, etc) between product and engineering staff. The feedback allows product development staff to proactively plan out future feature sets that engineers can more realistically commit to delivering.
    产品开发反馈 — 尽管所有开发方法都有花哨的名称,但实际工程进度与产品开发计划之间往往存在脱节。一种常见的策略是制定一个无法实现的积极开发计划,以尝试推动最大的工程产出。不幸的是,这种做法往往会导致对开发时间表的不尊重或产品质量低劣,导致逾期技术债务激增。更好的方法是保持一个有效的反馈循环,不断在产品和工程人员之间提供数据(包括实际进度估算、技术债务清算需求等)。反馈使产品开发人员能够主动规划工程师可以更实际地承诺交付的未来功能集。
  5. Lead by example — Too often do we see management handing down a set of rules to the staff while condescendingly assuming the rules don’t apply to themselves. It’s almost certain such rules will at best be followed superficially. Another commonly seen phenomenon is that management rally to create a culture which conflicts in many ways with their actual belief and style. They do it just because they were told they must create a culture to run a startup operation, but they ignore the fact that culture can only be built and fostered with themselves genuinely being a part of it. It cannot be fabricated.
    以身作则——我们经常看到管理层向员工传授一套规则,同时居高临下地认为这些规则不适用于他们自己。几乎可以肯定,这些规则充其量只能被肤浅地遵循。另一个常见的现象是,管理层团结起来创造一种文化,这种文化在许多方面与他们的实际信仰和风格相冲突。他们这样做只是因为他们被告知必须创造一种文化来经营创业公司,但他们忽略了一个事实,即文化只能通过自己真正成为其中的一部分来建立和培养。它不能被捏造。

Individuals:

  1. Honor the honor system — There may be various reasons contributing to the commonly seen distrust by the management. Unsurprisingly, one of them comes directly from individuals who abuse some of the employee benefits meant to be used on discretion. Perhaps the most common case is claiming the need for work-from-home with made-up reasons or without actually putting in the hours to work. Well, you can’t blame people’s distrust in you unless you first honor the honor system . For instance, when you do work from home, stick to the actual meaning of work-from-home. Also, making your availability known to those who need to work closely with you would be helpful. One effective way, especially for a relatively small team, would be to have a shared group calendar designated for showing up-to-date team members availability.
    尊重荣誉制度 — 可能有多种原因导致管理层普遍看到的不信任。不出所料,其中之一直接来自滥用一些本应酌情使用的员工福利的个人。也许最常见的情况是声称需要在家工作,理由是编造原因,或者没有实际投入工作时间。好吧,你不能责怪人们对你的不信任,除非你首先尊重荣誉制度.例如,当您在家工作时,请坚持在家工作的实际含义。此外,让需要与您密切合作的人知道您的可用性会有所帮助。一种有效的方法,特别是对于相对较小的团队,是指定一个共享的组日历来显示最新的团队成员可用性。
  2. Self discipline — Again, using work-from-home as an example, one needs to have the self-discipline to actually put in decent amount of hours to work. It’s easy to be distracted, for instance by your family members, when working at home, but it’s your own responsibility to make necessary arrangement to minimize any expected distractions. It’s also your obligation to make it clear to your teammates in advance when you will be unavailable for scheduled appointments or what-not.
    自律——同样,以在家工作为例,一个人需要有自律才能真正投入相当多的时间工作。在家工作时很容易分心,例如家人分心,但您有责任做出必要的安排,以尽量减少任何预期的分心。您也有义务提前向队友明确说明您何时无法参加预定的约会或其他时间。
  3. Reasonable work priority — For unplanned urgent family matters, no one will complain if you drop everything to take care of them. However, that doesn’t necessarily justify you should frequently compromise your attendance to work for all sorts of personal events, unless that’s a pre-agreed work schedule. Bottom line, if your career matters to you, you shouldn’t place your job responsibility way down your priority list from routine personal/family matters.
    合理的工作优先级 — 对于计划外的紧急家庭事务,如果您放下一切来照顾他们,没有人会抱怨。然而,这并不一定证明你应该经常牺牲你的出勤率来为各种个人活动工作,除非这是一个预先商定的工作时间表。底线是,如果你的职业对你很重要,你不应该把你的工作职责从日常的个人/家庭事务中放在你的优先事项列表中。
  4. Active participation — Most software engineers hate meetings, feeling that they consume too much of their time which could otherwise be used for actually building products. I think if the host and the participants are well-prepared for the meeting, it’ll successfully serve its purpose (e.g. information sharing, brainstorming, team building, etc) with minimum negative feeling. A unprepared participant attending a meeting carrying a feed-me mindset will likely feel the meeting time-wasting. With some preparation, chances are that you will be a lot more engaged in the discussion and be able to provide informed input. Such active participation can stimulate collective creativity and foster a culture of “best ideas win”.
    积极参与 - 大多数软件工程师讨厌会议,觉得他们消耗了太多的时间,否则这些时间本可以用于实际构建产品。我认为,如果主持人和参与者为会议做好了充分的准备,它将成功地达到其目的(例如信息共享、头脑风暴、团队建设等),并将负面情绪降至最低。一个毫无准备的参与者带着“喂我”的心态参加会议,可能会觉得会议浪费时间。通过一些准备,您可能会更多地参与讨论并能够提供明智的意见。这种积极参与可以激发集体创造力,培养“最佳创意获胜”的文化。
  5. Keep upgrading yourself — This may sound off-topic, but keeping yourself abreast of knowledge and best-practices in the very area of your core job responsibilities does help shape the team culture. Constant self-improvement will naturally boost up one’s confidence in their own domain which, in turn, facilitates efficient knowledge exchange and stimulates healthy competition. All that helps promote a high-efficiency no-nonsense culture. The competitive aspect presents healthy challenge to individuals, as long as excessive egos don’t get in the way. As a side note on “upgrading”, between breath and depth I would always lean toward depth. These days it’s too easy to claim familiarity of all sorts of robust frameworks and libraries on the surface, but the most wanted technologists are often the ones who demonstrated in-depth knowledge, say, down to the code level of a software library.
    不断提升自己——这听起来可能跑题了,但让自己了解核心工作职责领域的知识和最佳实践确实有助于塑造团队文化。不断的自我完善自然会增强一个人对自己领域的信心,从而促进有效的知识交流并刺激良性竞争。所有这些都有助于促进高效严肃的文化。竞争方面对个人提出了健康的挑战,只要过度的自我不妨碍。作为“升级”的旁注,在呼吸和深度之间,我总是倾向于深度。如今,人们很容易声称熟悉各种健壮的框架和库,但最需要的技术人员通常是那些表现出深入知识的人,比如,直到软件库的代码级别。

Final thoughts

Startup culture 1.0 left us a signature work style many aspire to embrace. It has evolved over the years into a more contemporary 2.0 that better suits modern software development models in various changeable competitive spaces. But it’s important we don’t superficially take the surface values of all the hyped-up buzzwords and mechanically apply them. The various cultural aspects should be selectively adopted and adjusted in accordance with the team’s strength and weaknesses, project type, etc. More importantly, whatever embraced should never be driven by distrust or insecurity.
创业文化1.0给我们留下了许多人渴望接受的标志性工作风格。多年来,它已经发展成为一个更现代的2.0,更适合各种多变竞争空间中的现代软件开发模型。但重要的是,我们不要肤浅地把所有被炒作的流行语的表面值机械地应用它们。各种文化方面应根据团队的优势和劣势、项目类型等有选择地采用和调整。更重要的是,无论接受什么,都不应该被不信任或不安全感所驱动。

1 thought on “Startup Culture 2.0
关于“创业文化2.0”的1条思考

  1. Sanjiva Nath
    桑吉瓦纳特 十月4,2016 2在:上午15

    A good elaboration of the growth of the startup culture in the Bay Area and some recommendations towards further evolution. It is easy to relate to many of the things described here but it will still require some introspection to affect further change.
    很好地阐述了湾区创业文化的发展,并为进一步发展提出了一些建议。很容易与这里描述的许多事情联系起来,但它仍然需要一些反省才能影响进一步的变化。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Relational Database Redemption
关系数据库赎回

Relational databases, such as PostgreSQL and Oracle, can be traced back to the 80’s when they became a dominant type of data management systems. Their prominence was further secured by the ANSI standardization of the domain specific language called SQL (Structured Query Language). Since then, RDBMS (relational database management system) has been the de facto component most data-centric applications would be architecturally centered around.
关系数据库,如PostgreSQL和Oracle,可以追溯到80年代,当时它们成为一种主导类型的数据管理系统。通过称为SQL(结构化查询语言)的领域特定语言的ANSI标准化,进一步确保了它们的突出地位。从那时起,RDBMS(关系数据库管理系统)一直是大多数以数据为中心的应用程序在架构上以它为中心的事实上的组件。

What happened to relational Databases?
关系数据库发生了什么变化?

It’s a little troubling, though, over the past 10-15 years, I’ve witnessed relational databases being sidelined from the core functionality requirement review or architectural design in many software engineering projects that involve data-centric applications. In particular, other kinds of databases would often be favored for no good reasons. And when relational databases are part of the core technology stack, thorough data model design would often be skipped and using of SQL would often be avoided in cases where it would be highly efficient.
不过,在过去的 10-15 年里,我目睹了关系数据库在许多涉及以数据为中心的应用程序的软件工程项目中被排除在核心功能需求审查或架构设计之外。特别是,其他类型的数据库通常会无缘无故地受到青睐。当关系数据库是核心技术堆栈的一部分时,通常会跳过全面的数据模型设计,并且在效率很高的情况下通常会避免使用SQL。

So, why have relational databases been treated with such noticeably less preference or seriousness? I believe a couple of causes have led to the phenomenon.
那么,为什么关系数据库被如此明显地视为不那么偏好或严肃呢?我相信有几个原因导致了这种现象。

Object-oriented data persistence architecture
面向对象的数据持久化架构

First, there was a shift in application architecture in late 90’s when object-oriented programming began to increasingly dominate in the computing world. In particular, the backend data persistence component of object-oriented applications began to take over the heavy lifting of the database CRUD (create/read/update/delete) operations which used to reside within the database tier via SQL or procedural language PL/SQL.
首先,在90年代后期,当面向对象编程开始在计算世界中日益占主导地位时,应用程序体系结构发生了转变。特别是,面向对象的应用程序的后端数据持久性组件开始接管数据库CRUD(创建/读取/更新/删除)操作的繁重工作,这些操作过去通过SQL或过程语言PL / SQL驻留在数据库层中。

Java EJB (enterprise Java bean), which was aimed to emulate data persistence and query functionality among other things, took the object-oriented programming world by storm. ORM (object-relational mapping) then further helped keep software engineers completely inside the Object world. Realizing that the initial EJB specifications were over-engineered, it later evolved into JPA (Java Persistence API) which also incorporates ORM functionality. All that doesn’t eliminate the need of relational databases, but engineering design focus has since been pulled away from the database tier and SQL has been treated as if it was irrelevant.
Java EJB(企业Java Bean)旨在模拟数据持久性和查询功能等,在面向对象编程领域掀起了一场风暴。ORM(对象关系映射)进一步帮助软件工程师完全进入对象世界。意识到最初的EJB规范被过度设计,后来演变成JPA(Java Persistence API),它也包含ORM功能。所有这些都不能消除对关系数据库的需求,但工程设计的重点已经从数据库层中移开,SQL被视为无关紧要。

NoSQL databases

Then, in late 00’s came column-oriented NoSQL databases like HBase and Cassandra, which were designed to primarily handle large-scale datasets. Designed to run on scalable distributed computing platforms, these databases are great for handling Big Data at the scale that conventional relational databases would have a hard time to perform well.
然后,在00年代后期出现了面向列的NoSQL数据库,如HBase和Cassandra,它们主要用于处理大规模数据集。这些数据库设计用于在可扩展的分布式计算平台上运行,非常适合以传统关系数据库难以良好执行的规模处理大数据。

Meanwhile, document-based NoSQL databases like MongoDB also emerged and have increasingly been adopted as part of the core technology stack by software engineers. These NoSQL databases have all of a sudden stole the spotlight in the database world. Relational databases were further perceptually “demoted” and SQL wouldn’t look right without a negation prefix.
与此同时,像MongoDB这样的基于文档的NoSQL数据库也出现了,并且越来越多地被软件工程师采用为核心技术堆栈的一部分。这些NoSQL数据库突然抢走了数据库界的聚光灯。关系数据库在感知上被进一步“降级”,如果没有否定前缀,SQL看起来就不对。

Object-oriented data persistence versus SQL, PL/SQL
面向对象的数据持久性与 SQL、PL/SQL

Just to be clear, I’m not against having the data persistence layer of the application handle the business logic of data manipulations and queries within the Object world. In fact, I think it makes perfect sense to keep data access business logic within the application tier using the same object-oriented programming paradigm, shielding software engineers from having to directly deal with things in the disparate SQL world.
需要明确的是,我不反对让应用程序的数据持久性层处理对象世界中数据操作和查询的业务逻辑。事实上,我认为使用相同的面向对象的编程范式将数据访问业务逻辑保留在应用程序层中是完全有意义的,使软件工程师不必直接处理不同SQL世界中的事情。

Another huge benefit of using the object-oriented data persistence is that it takes advantage of any scaling mechanism provided by the application servers (especially for those on distributed computing platforms), rather than, say, relying everything on database-resident PL/SQL procedures that don’t scale well.
使用面向对象数据持久性的另一个巨大好处是,它利用了应用程序服务器提供的任何扩展机制(特别是对于分布式计算平台上的应用程序服务器),而不是将所有内容都依赖于不能很好地扩展的驻留在数据库的PL / SQL过程。

What I’m against, though, is that proper design and usage best practices are skipped when a relational database is used, hallucinating that the ORM would just magically handle all the data manipulations/queries of a blob of poorly structured data. In addition, while ORMs can automatically generate SQLs for a relatively simple data model, they aren’t good at coming up with optimal efficient SQLs for many sophisticated models in the real world.
但是,我反对的是,当使用关系数据库时,会跳过正确的设计和使用最佳实践,从而产生幻觉,即ORM只会神奇地处理一团结构不佳的数据的所有数据操作/查询。此外,虽然ORM可以为相对简单的数据模型自动生成SQL,但它们并不擅长为现实世界中的许多复杂模型提供最佳高效的SQL。

NoSQL databases versus Relational databases
NoSQL 数据库与关系数据库

Another clarification point I thought I should raise is that – I love both SQL-based relational and NoSQL databases, and have adopted them as core parts of different systems in the past. I believe they have their own sweet spots as well as drawbacks, and should be adopted in accordance with the specific need in data persistence and consumption.
我认为我应该提出的另一个澄清点是 - 我喜欢基于SQL的关系数据库和NoSQL数据库,并且过去将它们作为不同系统的核心部分。我认为它们各有优缺点,应根据数据持久化和消费的具体需求予以采用。

I’ve seen some engineering organizations flocking to the NoSQL world for valid reasons, and others just for looking cool. I’ve also seen in a couple of occasions that companies decided to roll back from a NoSQL platform to using relational databases to better address their database transaction need after realizing that their increasing data volume demand can actually be handled fine with a properly designed relational database system.
我看到一些工程组织出于正当理由涌向NoSQL世界,而另一些则只是为了看起来很酷。我还在几次看到,公司决定从NoSQL平台回滚到使用关系数据库,以更好地满足他们的数据库事务需求,因为他们意识到他们不断增长的数据量需求实际上可以通过正确设计的关系数据库系统得到很好的处理。

In general, if your database need leans towards data warehousing and the projected data volume is huge, NoSQL is probably a great choice; otherwise, sticking to using relational databases might be the best deal. It all boils down to specific business requirement, and these days it’s also common that both database types are simultaneously adopted to complement each other. As to what’s considered huge, I would say it warrants a NoSQL database solution when one or more tables need to house 100’s of millions or more rows of data.
一般来说,如果你的数据库需要倾向于数据仓库,并且预计的数据量很大,NoSQL可能是一个不错的选择;否则,坚持使用关系数据库可能是最好的选择。这一切都归结为特定的业务需求,如今,同时采用这两种数据库类型来相互补充也很常见。至于什么被认为是巨大的,我想说当一个或多个表需要容纳100数百万行或更多行数据时,它保证了NoSQL数据库解决方案。

Why do relational databases still matter?
为什么关系数据库仍然很重要?

The answer to whether relational databases still matter is a decisive yes:
关系数据库是否仍然重要的答案是决定性的肯定:

  1. Real-world need of relational data models — A good portion of structured and inter-related data in the real world is still best represented by relational data models. While column-oriented databases excel in handling very large datasets, they aren’t designed for modeling relational data entities.
    关系数据模型的现实需求 — 现实世界中很大一部分结构化和相互关联的数据仍然最好地由关系数据模型表示。虽然面向列的数据库擅长处理非常大的数据集,但它们并不是为对关系数据实体进行建模而设计的。
  2. Transactional CRUD operations — Partly due to NoSQL database’s fundamental design, data often need to be stored in denormalized form for performance, and that makes transactional operations difficult. On the contrary, relational database is a much more suitable model for transactional CRUD operations that many types of applications require. That, coupled with the standard SQL language for transactional CRUD makes the role of relational databases not easily replaceable.
    事务性 CRUD 操作 — 部分由于 NoSQL 数据库的基本设计,数据通常需要以非规范化形式存储以提高性能,这使得事务操作变得困难。相反,关系数据库是更适合许多类型的应用程序所需的事务性 CRUD 操作的模型。再加上事务性 CRUD 的标准 SQL 语言,使得关系数据库的角色不容易被取代。
  3. Bulk data manipulations — Besides proven a versatile powerful tool in handling transactional CRUD, SQL also excels in manipulating data in bulk without compromise in atomicity. While PL/SQL isn’t suitable for all kinds of data manipulation tasks, when used with caution it provides procedural functionality in bulk data processing or complex ETL (extract-transform-load).
    批量数据操作 — 除了被证明是处理事务性 CRUD 的多功能强大工具外,SQL 还擅长批量操作数据,而不会影响原子性。虽然PL / SQL并不适合所有类型的数据操作任务,但谨慎使用时,它可以在批量数据处理或复杂的ETL(提取 - 转换 - 加载)中提供程序功能。
  4. Improved server hardware — Improvement in server processing power and low cost of memory and storage in recent years have helped make relational databases cope with the increasing demand of high data volume. On top of that, prominent database systems are equipped with robust data sharding and clustering features that also decidedly help in scalability. Relational databases with 10’s or even 100’s of million rows of data in a table aren’t uncommon these days.
    改进的服务器硬件 — 近年来服务器处理能力的提高以及内存和存储成本的降低有助于使关系数据库应对不断增长的高数据量需求。最重要的是,突出的数据库系统配备了强大的数据分片和集群功能,这也绝对有助于可扩展性。如今,一个表中有 10 甚至 100 万行数据的关系数据库并不少见。

Missing skills from today’s software architects
当今软件架构师缺少的技能

In recent years, I’ve encountered quite a few senior software engineers/architects with advanced programming skills but poor relational data modeling/SQL knowledge. With their computing backgound I believe many of these engineers could pick up the essential knowledge without too much effort. (That being said, I should add that while commanding the relational database fundamentals is rather trivial, becoming a database guru does require some decent effort.) It’s primarily the lack of drive to sharpen their skills in the specific domain that has led to the said phenomenon.
近年来,我遇到了不少高级软件工程师/架构师,他们具有高级编程技能,但关系数据建模/SQL知识很差。凭借他们的计算背景,我相信这些工程师中的许多人可以不费吹灰之力地掌握基本知识。(话虽如此,我应该补充一点,虽然掌握关系数据库基础知识是相当微不足道的,但成为一名数据库专家确实需要一些体面的努力。主要是缺乏在特定领域提高技能的动力,导致了上述现象。

The task of database design still largely falls on the shoulders of the software architect. Most database administrators can configure database systems and fine-tune queries at the operational level to ensure the databases are optimally run, but few possess business requirement knowledge or, in many cases, skills for database design. Suitable database design and data modeling requires intimate knowledge and understanding of business logic of the entire application that is normally in the software architect’s arena.
数据库设计的任务仍然主要落在软件架构师的肩上。大多数数据库管理员可以在操作级别配置数据库系统并微调查询,以确保数据库以最佳方式运行,但很少有人拥有业务需求知识,或者在许多情况下具有数据库设计技能。合适的数据库设计和数据建模需要对整个应用程序的业务逻辑有深入的了解和理解,而这通常是软件架构师的领域。

Even in the NoSQL world of column-oriented databases, I’ve noticed that database design skills are also largely missing. Part of NoSQL database’s signature is that data columns don’t have to be well-defined upfront and can be added later as needed. Because of that, many software architects tend to think that they have the liberty to bypass proper schema design upfront. The truth is that NoSQL databases do need proper schema design as well. For instance, in HBase, due to the by-design limitation of indexing, one needs to carefully lay out upfront what the row key is comprised of and what column families will be maintained.
即使在面向列的数据库的NoSQL世界中,我也注意到数据库设计技能在很大程度上也缺失了。NoSQL数据库的部分特征是数据列不必预先明确定义,以后可以根据需要添加。正因为如此,许多软件架构师倾向于认为他们可以自由地绕过适当的模式设计。事实是,NoSQL数据库也需要适当的模式设计。例如,在 HBase 中,由于索引的设计限制,需要仔细预先布置行键的组成以及将维护哪些列系列。

Old and monolithic?
古老而单一?

Aside from causes related to the disruptive technologies described above, some misconceptions that associate relational databases with obsolete technology or monolithic design have also helped contribute to the unwarranted negative attitude towards RDBMS.
除了与上述颠覆性技术相关的原因外,一些将关系数据库与过时技术或单体设计联系起来的误解也助长了对RDBMS的无端负面态度。

Old != Obsolete — Relational database technology is old. Fundamentally it hasn’t changed since decades ago, whereas new computing and data persistence technology buzzwords keep popping up left and right non-stopped. Given so many emerging technologies that one wants to learn all at once, old RDBMS often gets placed at the bottom of the queue. In any case, if a technology is old but continues to excel within its domain, it isn’t obsolete.
旧 != 过时 — 关系数据库技术已经过时。从根本上说,自几十年前以来,它一直没有改变,而新的计算和数据持久性技术流行语不断出现。鉴于人们想要一次学习的新兴技术如此之多,旧的RDBMS经常被放在队列的底部。无论如何,如果一项技术是旧的,但在其领域内继续表现出色,它就不会过时。

RDBMS != Monolith — Contemporary software architects have been advocating against monolithic design. In recent years, more and more applications have been designed and built as microservices with isolated autonomous services and data locality. That’s all great stuff in the ever-evolving software engineering landscape, but when people automatically categorize an application with a high-volume relational database a monolithic system, that’s a flawed assumption.
RDBMS != Monolith — 当代软件架构师一直提倡反对单片设计。近年来,越来越多的应用程序被设计和构建为具有隔离自治服务和数据局部性的微服务。在不断发展的软件工程领域,这些都是很棒的东西,但是当人们自动将具有大容量关系数据库的应用程序分类为整体系统时,这是一个有缺陷的假设。

Bottom line, as long as much of the data in the real world is still best represented in relational data models, RDBMS will have its place in the computing world.
最重要的是,只要现实世界中的许多数据仍然在关系数据模型中得到最好的体现,RDBMS将在计算世界中占有一席之地。

1 thought on “Relational Database Redemption
关于“关系数据库赎回”的 1 条思考

  1. Elsa Anderson
    艾尔莎安德森 四月17,2018 12时:下午36

    You make a great point that relational databases are needed because the real world is best represented by relational models. This is a large benefit that I will definitely tell my dad about because his business is thinking about using a data base for future sales. Also, it makes sense that relational databases can store hundreds of millions of rows of data which would definitely give my dad peace of mind that he won’t run out of space for his business data.
    你提出了一个很好的观点,即需要关系数据库,因为现实世界最好由关系模型表示。这是一个很大的好处,我肯定会告诉我父亲,因为他的企业正在考虑使用数据库进行未来的销售。此外,关系数据库可以存储数亿行数据是有道理的,这肯定会让我父亲放心,因为他不会用完他的业务数据空间。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

PostgreSQL Table Partitioning
PostgreSQL 表分区

With the ever growing demand for data science work in recent years, PostgreSQL has gained superb popularity especially in areas where extensive geospatial/GIS (geographic information system) functionality is needed. In a previous startup venture, MySQL was initially adopted and I went through the trouble of migrating to PostgreSQL mainly because of the sophisticated geospatial features PostGIS offers.
随着近年来对数据科学工作的需求不断增长,PostgreSQL已经获得了极高的普及,特别是在需要广泛的地理空间/ GIS(地理信息系统)功能的领域。在之前的创业公司中,MySQL最初被采用,我经历了迁移到PostgreSQL的麻烦,主要是因为PostGIS提供的复杂的地理空间功能。

PostgreSQL offers a lot of goodies, although it does have a few things that I wish were done differently. Most notable to me is that while its SELECT statement supports SQL-92 Standard’s JOIN syntax, its UPDATE statement would not. For instance, the following UPDATE statement would not work in PostgreSQL:
PostgreSQL提供了很多好东西,尽管它确实有一些我希望以不同方式完成的事情。对我来说最值得注意的是,虽然它的 SELECT 语句支持 SQL-92 标准的 JOIN 语法,但它的 UPDATE 语句不支持。例如,以下 UPDATE 语句在 PostgreSQL 中不起作用:

Partial indexing

Nevertheless, for general performance and scalability, PostgreSQL remains one of the top candidates with proven track record in the world of open source RDBMS. In scaling up a PostgreSQL database, there is a wide variety of approaches. Suitable indexing is probably one of the first things to look into. Aside from planning out proper column orders in indexes that are optimal for the frequently used queries, there is another indexing feature that PostgreSQL provides for handling large datasets.
尽管如此,就一般性能和可扩展性而言,PostgreSQL仍然是开源RDBMS领域拥有良好记录的最佳候选者之一。在扩展PostgreSQL数据库时,有各种各样的方法。合适的索引可能是首先要研究的事情之一。除了在索引中规划出最适合常用查询的正确列顺序之外,PostgreSQL 还提供了另一个用于处理大型数据集的索引功能。

Partial indexing allows an index to be built over a subset of a table based on a conditional expression. For instance:
部分索引允许基于条件表达式在表的子集上构建索引。例如:

In the case of a table with large amount of rows, this feature could make an otherwise gigantic index much smaller, thus more efficient for queries against the selectively indexed data.
对于具有大量行的表,此功能可以使原本巨大的索引更小,从而更有效地查询选择性索引数据。

Scaling up with table partitioning
使用表分区向上扩展

However, when a table grows to certain volume, say, beyond a couple of hundreds of million rows, and if periodically archiving off data from the table isn’t an option, it would still be a problem even with applicable indexing strategy. In many cases, it might be necessary to do something directly with the table structure and table partitioning is often a good solution.
但是,当表增长到一定数量(例如,超过几亿行)时,如果无法选择定期存档表中的数据,即使使用适用的索引策略,它仍然是一个问题。在许多情况下,可能需要直接对表结构执行某些操作,而表分区通常是一个很好的解决方案。

There are a few approaches to partition a PostgreSQL table. Among them, partitioning by means of table inheritance is perhaps the most popular approach. A master table will be created as a template that defines the table structure. This master table will be empty whereas a number of child tables inherited from this master table will actually host the data.
有几种方法可以对PostgreSQL表进行分区。其中,通过表继承进行分区可能是最流行的方法。将创建一个主表作为定义表结构的模板。此主表将为空,而从此主表继承的许多子表将实际承载数据。

The partitioning is based on a partition key which can be a column or a combination of columns. In some common use cases, the partition keys are often date-time related. For instance, a partition key could be defined in a table to partition all sales orders by months with constraint like the following:
分区基于分区键,该分区键可以是列或列的组合。在一些常见用例中,分区键通常与日期时间相关。例如,可以在表中定义分区键,以按月对所有销售订单进行分区,并具有如下所示的约束:

order_date >= ‘2016-12-01 00:00:00’ AND order_date < ‘2017-01-01 00:00:00’
order_date >= '2016-12-01 00:00:00' 和 order_date < '2017-01-01 00:00:00'

Other common cases include partitioning geographically, etc.
其他常见情况包括地理分区等。

A table partitioning example
表分区示例

When I was with a real estate startup building an application that involves over 100 millions nationwide properties, each with multiple attributes of interest, table partitioning was employed to address the demanding data volume. Below is a simplified example of how the property sale transaction table was partitioned to maintain a billion rows of data.
当我在一家房地产初创公司构建一个涉及 1 亿多个全国性属性的应用程序时,每个属性都有多个感兴趣的属性,因此采用了表分区来解决要求苛刻的数据量。下面是一个简化的示例,说明如何对房地产销售交易表进行分区以维护十亿行数据。

First, create the master table which will serve as the template for the table structure.
首先,创建将用作表结构模板的主表。

Next, create child tables inheriting from the master table for the individual states. For simplicity, I only set up 24 states for performance evaluation.
接下来,为各个状态创建从主表继承的子表。为简单起见,我只设置了 24 个状态进行性能评估。

Nothing magical so far, until a suitable trigger for propagating insert is put in place. The trigger essentially redirects insert requests against the master table to the corresponding child tables.
到目前为止,没有什么神奇的,直到一个合适的传播插入触发器到位。触发器实质上是将针对主表的插入请求重定向到相应的子表。

Let’s test inserting data into the partitioned tables via the trigger:
让我们测试通过触发器将数据插入分区表:

A Python program for data import
用于数据导入的 Python 程序

Now that the master table and its child tables are functionally in place, we’re going to populate them with large-scale data for testing. First, write a simple program using Python (or any other programming/scripting language) as follows to generate simulated data in a tab-delimited file for data import:
现在主表及其子表在功能上已就位,我们将使用大规模数据填充它们以进行测试。首先,使用 Python(或任何其他编程/脚本语言)编写一个简单的程序,如下所示,以在制表符分隔的文件中生成模拟数据以进行数据导入:

Run the Python program to generate up to 1 billion rows of property sale data. Given the rather huge output, make sure the generated file is on a storage device with plenty of space. Since it’s going to take some time to finish the task, it would better be run in the background, perhaps along with mail notification, like the following:
运行 Python 程序以生成多达 10 亿行的房地产销售数据。鉴于输出相当大,请确保生成的文件位于具有足够空间的存储设备上。由于完成任务需要一些时间,因此最好在后台运行,也许与邮件通知一起运行,如下所示:

Next, load data from the generated infile into the partitioned tables using psql. In case there are indexes created for the partitioned tables, it would generally be much more efficient to first drop them and recreate them after loading the data, like in the following:
接下来,使用 psql 将数据从生成的 infile 加载到分区表中。如果为分区表创建了索引,通常先删除它们并在加载数据后重新创建它们会更有效,如下所示:

Query with Constraint Exclusion
使用约束排除的查询

Prior to querying the tables, make sure the query optimization parameter, constraint_exclusion, is enabled.
在查询表之前,请确保已启用查询优化参数 constraint_exclusion。

With constraint exclusion enabled, the query planner will be smart enough to examine query constraints to exclude scanning of those partitioned tables that don’t match the constraints. Unfortunately, though, if the constraints involve matching against non-constants like the NOW() function, the query planner won’t have enough information to filter out unwanted partitions hence won’t be able to take advantage of the optimization.
启用约束排除后,查询计划程序将足够智能地检查查询约束,以排除对那些与约束不匹配的分区表的扫描。但不幸的是,如果约束涉及与 NOW() 函数等非常量匹配,则查询计划器将没有足够的信息来过滤掉不需要的分区,因此将无法利用优化。

Final notes

With a suitable partitioning scheme applied to a big table, query performance can be improved by an order of magnitude. As illustrated in the above case, the entire partitioning scheme centers around the key column used for partitioning, hence it’s critical to properly plan out which key column (or combination of columns) to partition. Number of partitions should also be carefully thought out, as too few partitions might not help whereas too many partitions would create too much overhead.
将合适的分区方案应用于大表,查询性能可以提高一个数量级。如上例所示,整个分区方案都以用于分区的键列为中心,因此正确规划要分区的键列(或列的组合)至关重要。分区的数量也应该仔细考虑,因为太少的分区可能无济于事,而太多的分区会产生太多的开销。

1 thought on “PostgreSQL Table Partitioning
关于“PostgreSQL 表分区”的 1 条思考

  1. Pingback: Streaming ETL With Alpakka Kafka | Genuine Blog
    Pingback: Streaming ETL with Alpakka Kafka |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Text Mining With Akka Streams
使用 Akka 流进行文本挖掘

Reactive Systems, whose core characteristics are declared in the Reactive Manifesto, have started to emerge in recent years as message-driven systems that emphasize scalability, responsiveness and resilience. It’s pretty clear from the requirements that a system can’t be simply made Reactive. Rather, it should be built from the architectural level to be Reactive.
响应式系统,其核心特征在响应式宣言中声明,近年来开始作为强调可扩展性,响应性和弹性的消息驱动系统出现。从需求中可以清楚地看出,系统不能简单地成为反应式的。相反,它应该从架构级别构建为响应式。

Akka’s actor systems, which rely on asynchronous message-passing among lightweight loosely-coupled actors, serve a great run-time platform for building Reactive Systems on the JVM (Java Virtual Machine). I posted a few blogs along with sample code about Akka actors in the past. This time I’m going to talk about something different but closely related.
Akka的Actor系统依赖于轻量级松散耦合actor之间的异步消息传递,为在JVM(Java虚拟机)上构建响应式系统提供了一个很好的运行时平台。我过去发布了一些博客以及有关 Akka 演员的示例代码。这次我将谈论一些不同但密切相关的东西。

Reactive Streams

While bearing a similar name, Reactive Streams is a separate initiative that mandates its implementations to be capable of processing stream data asynchronously and at the same time automatically regulating the stream flows in a non-blocking fashion.
虽然名称相似,但反应流是一个单独的计划,它要求其实现能够异步处理流数据,同时以非阻塞方式自动调节流流。

Akka Streams, built on top of Akka actor systems, is an implementation of Reactive Streams. Equipped with the back-pressure functionality, it eliminates the need of manually buffering stream flows or custom-building stream buffering mechanism to avoid buffer overflow problems.
Akka Streams建立在Akka角色系统之上,是Reactive Streams的实现。配备背压功能,无需手动缓冲流或定制流缓冲机制,以避免缓冲区溢出问题。

Extracting n-grams from text
从文本中提取 n 元语法

In text mining, n-grams are useful data in the area of NLP (natural language processing). In this blog post, I’ll illustrate extracting n-grams from a stream of text messages using Akka Streams with Scala as the programming language.
在文本挖掘中,n-gram是NLP(自然语言处理)领域的有用数据。在这篇博文中,我将说明如何使用 Akka Streams 和 Scala 作为编程语言从文本消息流中提取 n-gram。

First thing first, let’s create an object with methods for generating random text content:
首先,让我们创建一个对象,其中包含用于生成随机文本内容的方法:

Source code: TextMessage.scala
源代码:TextMessage.scala

Some minimal effort has been made to generate random clauses of likely pronounceable fake words along with punctuations. To make it a little more flexible, lengths of individual words and clauses would be supplied as parameters.
已经做出了一些最小的努力来生成可能发音的假词的随机子句以及标点符号。为了使它更加灵活,将提供单个单词和子句的长度作为参数。

Next, create another object with text processing methods responsible for extracting n-grams from input text, with n being an input parameter. Using Scala’s sliding(size, step) iterator method with size n and step default to 1, a new iterator of sliding window view is generated to produce the wanted n-grams.
接下来,使用文本处理方法创建另一个对象,负责从输入文本中提取 n 元语法,其中 n 是输入参数。使用 Scala 的滑动(大小,步长)迭代器方法(大小为 n,步长默认为 1),生成滑动窗口视图的新迭代器以生成所需的 n 元语法。

Source code: TextProcessor.scala
源代码: TextProcessor.scala

Now that the text processing tools are in place, we can focus on building the main streaming application in which Akka Streams plays the key role.
现在文本处理工具已经到位,我们可以专注于构建主要的流媒体应用程序,其中 Akka Streams 起着关键作用。

First, make sure we have the necessary library dependencies included in build.sbt:
首先,确保我们在 build.sbt 中包含必要的库依赖项:

Source code: build.sbt
源代码:build.sbt

As Akka Streams is relatively new development work, more recent Akka versions (2.4.9 or higher) should be used.
由于 Akka Streams 是相对较新的开发工作,因此应使用更新的 Akka 版本(2.4.9 或更高版本)。

Let’s start with a simple stream for this text mining application:
让我们从此文本挖掘应用程序的简单流开始:

Source code: NgramStream_v01.scala
源代码: NgramStream_v01.scala

As shown in the source code, constructing a simple stream like this is just defining and chaining together the text-generating source, the text-processing flow and the text-display sink as follows:
如源代码所示,构造这样的简单流只是定义文本生成源、文本处理流和文本显示接收器并将其链接在一起,如下所示:

Graph DSL

Akka Streams provides a Graph DSL (domain-specific language) that helps build the topology of stream flows using predefined fan-in/fan-out functions.
Akka Streams提供了一个图形DSL(特定于域的语言),它使用预定义的扇入/扇出函数帮助构建流流的拓扑。

What Graph DSL does is somewhat similar to how Apache Storm‘s TopologyBuilder pieces together its spouts (i.e. stream sources), bolts (i.e. stream processors) and stream grouping/partitioning functions, as illustrated in a previous blog about HBase streaming.
Graph DSL所做的有点类似于Apache Storm的TopologyBuilder如何将其喷口(即流源),螺栓(即流处理器)和流分组/分区函数拼凑在一起,如之前关于HBase流的博客所示。

Back-pressure

Now, let’s branch off the stream using Graph DSL to illustrate how the integral back-pressure feature is at play.
现在,让我们使用 Graph DSL 分支流,以说明积分背压功能是如何发挥作用的。

Source code: NgramStream_v02.scala
源代码: NgramStream_v02.scala

Streaming to a file should be significantly slower than streaming to the console. To make the difference more noticeable, a delay is deliberately added to streaming each line of text in the file sink.
流式传输到文件应该比流式传输到控制台慢得多。为了使差异更加明显,特意在流式传输文件接收器中的每一行文本时添加了延迟。

Running the application and you will notice that the console display is slowed down. It’s the result of the upstream data flow being regulated to accommodate the relatively slow file I/O outlet even though the other console outlet is able to consume relatively faster – all that being conducted in a non-blocking fashion.
运行应用程序,您会注意到控制台显示速度变慢。这是上游数据流被调节以适应相对较慢的文件 I/O 出口的结果,即使另一个控制台出口能够消耗相对较快的速度 - 所有这些都是以非阻塞方式进行的。

Graph DSL create() methods
图形DSL create()方法

To build a streaming topology using Graph DSL, you’ll need to use one of the create() methods defined within trait GraphApply, which is extended by object GraphDSL. Here are the signatures of the create() methods:
要使用 Graph DSL 构建流式拓扑,您需要使用在 trait GraphApply 中定义的 create() 方法之一,该方法由对象 GraphDSL 扩展。以下是 create() 方法的签名:

Note that the sbt-boilerplate template language is needed to interpret the create() method being used in the application that takes multiple stream components as input parameters.
请注意,需要 sbt 样板模板语言来解释将多个流组件作为输入参数的应用程序中使用的 create() 方法。

Materialized values
物化值

In Akka Streams, materializing a constructed stream is the step of actually running the stream with the necessary resources. To run the stream, the implicitly passed factory method ActorMaterializer() is required to allocate the resources for stream execution. That includes starting up the underlying Akka actors to process the stream.
在 Akka Streams 中,具体化构造的流是使用必要资源实际运行流的步骤。若要运行流,需要隐式传递的工厂方法 ActorMaterializer() 来分配用于流执行的资源。这包括启动底层 Akka actor 来处理流。

Every processing stage of the stream can produce a materialized value. By default, using the via(flow) and to(sink) functions, the materialized value of the left-most stage will be preserved. As in the following example, for graph1, the materialized value of the source is preserved:
流的每个处理阶段都可以生成一个具体化值。默认情况下,使用 via(flow) 和 to(sink) 函数,将保留最左侧阶段的具体化值。如以下示例所示,对于 graph1,将保留源的具体化值:

To allow one to selectively capture the materialized values of the specific stream components, Akka Streams provides functions viaMat(flow) and toMat(sink) along with a combiner function, Keep. As shown in the above example, for graph2, the materialized value of the flow is preserved, whereas for graph3, materialized values for both the flow and sink are preserved.
为了允许人们有选择地捕获特定流组件的具体化值,Akka Streams提供了通过Mat(flow)和toMat(sink)以及组合器函数Keep的功能。如上例所示,对于 graph2,将保留流的具体化值,而对于 graph3,将保留流和汇的具体化值。

Back to our fileSink function as listed below, toMat(fileIOSink)(Keep.right) instructs Akka Streams to keep the materialized value of the fileIOSink as a Future value of type IOResult:
回到下面列出的fileSink函数,toMat(fileIOSink)(Keep.right)指示Akka Streams将fileIOSink的具体化值保留为IOResult类型的未来值:

Using Graph DSL, as seen earlier in the signature of the create() method, one can select what materialized value is to be preserved by specifying the associated stream components accordingly as the curried parameters:
使用 Graph DSL,如前面在 create() 方法的签名中看到的那样,可以通过相应地将关联的流组件指定为柯里化参数来选择要保留的具体化值:

In our case, we want the materialized value of fileSink, thus the curried parameters should look like this:
在我们的例子中,我们想要 fileSink 的具体化值,因此柯里参数应如下所示:

Defining the stream graph
定义流图

Akka Streams provides a number of functions for fan-out (e.g. Broadcast, Balance) and fan-in (e.g. Merge, Concat). In our example, we want a simple topology with a single text source and the same n-gram generator flow branching off to two sinks in parallel:
Akka Streams提供了许多扇出(例如广播,平衡)和扇入(例如合并,Concat)的功能。在我们的示例中,我们需要一个简单的拓扑,其中包含单个文本源和相同的 n-gram 生成器流,并并行分支到两个接收器:

Adding a message counter
添加消息计数器

Let’s further expand our n-gram extraction application to include displaying a count. A simple count-flow is created to map each message string into numeric 1, and a count-sink to sum up all these 1’s streamed to the sink. Adding them as the third flow and sink to the existing stream topology yields something similar to the following:
让我们进一步扩展我们的 n-gram 提取应用程序,以包括显示计数。创建一个简单的计数流以将每个消息字符串映射到数字 1,并创建一个计数接收器来汇总流式传输到接收器的所有 1。将它们作为第三个流和接收器添加到现有流拓扑中会产生类似于以下内容的内容:

Source code: NgramStream_v03.scala
源代码: NgramStream_v03.scala

Full source code of the application is at GitHub.
该应用程序的完整源代码位于GitHub。

Final thoughts

Having used Apache Storm, I see it a rather different beast compared with Akka Streams. A full comparison between the two would obviously be an extensive exercise by itself, but it suffices to say that both are great platforms for streaming applications.
使用过Apache Storm之后,我认为与Akka Streams相比,它是一头完全不同的野兽。两者之间的全面比较显然本身就是一项广泛的工作,但足以说两者都是流媒体应用程序的绝佳平台。

Perhaps one of the biggest differences between the two is that Storm provides granular message delivery options (at most / at least / exactly once, guaranteed message delivery) whereas Akka Streams by design questions the premise of reliable messaging on distributed systems. For instance, if guaranteed message delivery is a requirement, Akka Streams would probably not be the best choice.
也许两者之间最大的区别之一是 Storm 提供了细粒度的消息传递选项(最多/至少/正好一次,保证消息传递),而 Akka Streams 在设计上质疑分布式系统上可靠消息传递的前提。例如,如果需要保证消息传递,则 Akka Streams 可能不是最佳选择。

Back-pressure has recently been added to Storm’s v.1.0.x built-in feature list, so there is indeed some flavor of reactiveness in it. Aside from message delivery options, choosing between the two technologies might be a decision basing more on other factors such as engineering staff’s expertise, concurrency model preference, etc.
背压最近被添加到 Storm 的 v.1.0.x 内置功能列表中,因此它确实有一些反应性。除了消息传递选项之外,在这两种技术之间进行选择可能更多地基于其他因素(如工程人员的专业知识、并发模型首选项等)做出决策。

Outside of the turf of typical streaming systems, Akka Streams also plays a key role as the underlying platform for an emerging service stack. Viewed as the next-generation of Spray.io, Akka HTTP is built on top of Akka Streams. Designed for building HTTP-based integration layers, Akka HTTP provides versatile streaming-oriented HTTP routing and request/response transformation mechanism. Under the hood, Akka Streams’ back-pressure functionality regulates data streaming between the server and the remote client, consequently conserving memory utilization on the server.
在典型流媒体系统之外,Akka Streams作为新兴服务堆栈的底层平台也发挥着关键作用。作为下一代 Spray.io,Akka HTTP建立在Akka Streams之上。Akka HTTP 专为构建基于 HTTP 的集成层而设计,提供面向流的通用 HTTP 路由和请求/响应转换机制。在后台,Akka Streams的背压功能调节服务器和远程客户端之间的数据流,从而节省服务器上的内存利用率。

1 thought on “Text Mining With Akka Streams
关于“使用 Akka 流进行文本挖掘”的 1 条思考

  1. Pingback: Akka Dynamic Pub-Sub Service | Genuine Blog
    pingback: 阿卡动态发布-订阅服务 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala IoT Systems With Akka Actors II

Back in 2016, I built an Internet-of-Thing (IoT) prototype system leveraging the “minimalist” design principle of the Actor model to simulate low-cost, low-powered IoT devices. A simplified version of the prototype was published in a previous blog post. The stripped-down application was written in Scala along with the Akka Actors run-time library, which is arguably the predominant Actor model implementation at present. Message Queue Telemetry Transport (MQTT) was used as the publish-subscribe messaging protocol for the simulated IoT devices. For simplicity, a single actor was used to simulate requests from a bunch of IoT devices.
早在 2016 年,我就利用 Actor 模型的“极简主义”设计原理构建了一个物联网 (IoT) 原型系统来模拟低成本、低功耗的物联网设备。原型的简化版本发表在之前的博客文章中。精简的应用程序是用Scala编写的,还有Akka Actor运行时库,可以说是目前主要的Actor模型实现。消息队列遥测传输 (MQTT) 用作模拟物联网设备的发布-订阅消息传递协议。为简单起见,使用单个参与者来模拟来自一堆物联网设备的请求。

In this blog post, I would like to share a version closer to the design of the full prototype system. With the same tech stack used in the previous application, it’s an expanded version (hence, II) that uses loosely-coupled lightweight actors to simulate individual IoT devices, each of which maintains its own internal state and handles bidirectional communications via non-blocking message passing. Using a distributed workers system adapted from a Lightbend template along with a persistence journal, the end product is an IoT system equipped with a scalable fault-tolerant data processing system.
在这篇博文中,我想分享一个更接近完整原型系统设计的版本。使用与上一个应用程序中使用的相同技术堆栈,它是一个扩展版本(因此,II),它使用松散耦合的轻量级参与者来模拟单个物联网设备,每个设备都保持自己的内部状态,并通过非阻塞消息传递处理双向通信。使用改编自 Lightbend 模板的分布式工作线程系统以及持久性日志,最终产品是一个配备可扩展容错数据处理系统的物联网系统。

Main components

Below is a diagram and a summary of the revised Scala application which consists of 3 main components:
下面是修订后的 Scala 应用程序的图表和摘要,该应用程序由 3 个主要组件组成:

IoT with MQTT and Akka Actor Systems v.2

1. IoT

  • An IotManager actor which:
    一个 IotManager actor,它:
    • instantiates a specified number of devices upon start-up
      在启动时实例化指定数量的设备
    • subscribes to a MQTT pub-sub topic for the work requests
      订阅工作请求的 MQTT 发布-订阅主题
    • sends received work requests via ClusterClient to the master cluster
      通过群集客户端将收到的工作请求发送到主群集
    • notifies Device actors upon receiving failure messages from Master actor
      收到来自主参与者的失败消息时通知设备参与者
    • forwards work results to the corresponding devices upon receiving them from ResultProcessor
      从结果处理器接收工作结果后,将工作结果转发到相应的设备
  • Device actors each of which:
    设备参与者,每个:
    • simulates a thermostat, lamp, or security alarm with random initial state and setting
      模拟具有随机初始状态和设置的恒温器、灯或安全警报
    • maintains and updates internal state and setting upon receiving work results from IotManager
      从 IotManager 接收工作结果时维护和更新内部状态和设置
    • generates work requests and publishes them to the MQTT pub-sub topic
      生成工作请求并将其发布到 MQTT 发布-订阅主题
    • re-publishes requests upon receiving failure messages from IotManager
      收到来自 IotManager 的失败消息时重新发布请求
  • A MQTT pub-sub broker and a MQTT client for communicating with the broker
    MQTT 发布-订阅代理和用于与代理通信的 MQTT 客户端
  • A configuration helper object, MqttConfig, consisting of:
    配置帮助程序对象 MqttConfig,包括:
    • MQTT pub-sub topic
      MQTT 发布-订阅主题
    • URL for the MQTT broker
      MQTT 代理的网址
    • serialization methods to convert objects to byte arrays, and vice versa
      将对象转换为字节数组的序列化方法,反之亦然

2. Master Cluster
2. 主集群

  • A fault-tolerant decentralized cluster which:
    一个容错的去中心化集群,它:
    • manages a singleton actor instance among the cluster nodes (with a specified role)
      管理群集节点之间的单一实例执行组件实例(具有指定角色)
    • delegates ClusterClientReceptionist on every node to answer external connection requests
      在每个节点上委派 ClusterClientReceptionist 以应答外部连接请求
    • provides fail-over of the singleton actor to the next-oldest node in the cluster
      提供单一实例参与者到群集中下一个最旧的节点的故障转移
  • A Master singleton actor which:
    单身演员大师:
    • registers Workers and distributes work to available Workers
      登记工作人员并将工作分配给可用的工作人员
    • acknowledges work request reception with IotManager
      使用物联网管理器确认工作请求接收
    • publishes work results from Workers to ‘work-results’ topic via Akka distributed pub-sub
      通过 Akka 分布式发布-订阅将工人的工作结果发布到“工作结果”主题
    • maintains work states using persistence journal
      使用持久性日志维护工作状态
  • A ResultProcessor actor in the master cluster which:
    主集群中的 ResultProcessor 参与者,它:
    • gets instantiated upon starting up the IoT system (more on this below)
      在启动物联网系统时实例化(更多内容见下文)
    • consumes work results by subscribing to the ‘work-results’ topic
      通过订阅“工作结果”主题来消费工作结果
    • sends work results received from Master to IotManager
      将从主服务器接收的工作结果发送到物联网管理器

3. Workers

  • An actor system of Workers each of which:
    工人的演员系统,每个工人:
    • communicates via ClusterClient with the master cluster
      通过集群客户端与主集群进行通信
    • registers with, pulls work from the Master actor
      注册,从大师演员那里提取作品
    • reports work status with the Master actor
      使用主演员报告工作状态
    • instantiates a WorkProcessor actor to perform the actual work
      实例化工作处理器参与者以执行实际工作
  • WorkProcessor actors each of which:
    工作处理器执行每个角色:
    • processes the work requests from its parent Worker
      处理来自其父工作线程的工作请求
    • generates work results and send back to Worker
      生成工作结果并发送回工作人员

Master-worker system with a ‘pull’ model
具有“拉动”模型的主-工人系统

While significant changes have been made to the IoT actor system, much of the setup for the Master/Worker actor systems and MQTT pub-sub messaging remains largely unchanged from the previous version:
虽然对 IoT actor 系统进行了重大更改,但主/工作角色系统和 MQTT 发布-订阅消息传递的大部分设置与以前的版本基本保持不变:

  • As separate independent actor systems, both the IoT and Worker systems communicate with the Master cluster via ClusterClient.
    作为独立的参与者系统,物联网和工作器系统都通过集群客户端与主集群通信。
  • Using a ‘pull’ model which generally performs better at scale, the Worker actors register with the Master cluster and pull work when available.
    使用通常在规模上表现更好的“拉取”模型,辅助角色参与者向主集群注册并在可用时拉取工作。
  • Paho-Akka is used as the MQTT pub-sub messaging client.
    Paho-Akka 用作 MQTT pub-sub 消息传递客户端。
  • A helper object, MqttConfig, encapsulates a MQTT pub-sub topic and broker information along with serialization methods to handle MQTT messaging using a test Mosquitto broker.
    帮助程序对象 MqttConfig 封装了 MQTT 发布-订阅主题和代理信息,以及使用测试 Mosquitto 代理处理 MQTT 消息传递的序列化方法。

What’s new?
有什么新变化?

Now, let’s look at the major changes in the revised application:
现在,让我们看一下修订后的应用程序的主要变化:

First of all, Lightbend’s Activator has been retired and Sbt is being used instead.
首先,Lightbend的激活器已经停用,取而代之的是Sbt。

On persisting actors state, a Redis data store is used as the persistence journal. In the previous version the shared LevelDB journal is coupled with the first seed node which becomes a single point of failure. With the Redis persistence journal decoupled from a specific cluster node, fault tolerance steps up a notch.
在持久执行组件状态上,Redis 数据存储用作持久性日志。在以前的版本中,共享的 LevelDB 日志与第一个种子节点耦合,成为单点故障。随着 Redis 持久性日志与特定集群节点分离,容错能力更上一层楼。

As mentioned earlier in the post, one of the key changes to the previous application is the using of actors representing individual IoT devices each with its own state and capability of communicating with entities designated for interfacing with external actor systems. Actors, lightweight and loosely-coupled by design, serve as an excellent vehicle for modeling individual IoT devices. In addition, non-blocking message passing among actors provides an efficient and economical means for communication and logic control of the device state.
如本文前面所述,对先前应用程序的关键更改之一是使用代表单个物联网设备的参与者,每个参与者都有自己的状态和与指定用于与外部参与者系统接口的实体进行通信的能力。Actor,轻量级和松散耦合的设计,是建模单个物联网设备的绝佳工具。此外,参与者之间的非阻塞消息传递为设备状态的通信和逻辑控制提供了一种高效且经济的方法。

The IotManager actor is responsible for creating and managing a specified number of Device actors. Upon startup, the IoT manager instantiates individual Device actors of random device type (thermostat, lamp or security alarm). These devices are maintained in an internal registry regularly updated by the IoT manager.
IotManager 参与者负责创建和管理指定数量的设备参与者。启动时,物联网管理器实例化随机设备类型(恒温器、灯或安全警报)的单个设备参与者。这些设备在物联网管理器定期更新的内部注册表中进行维护。

Each of the Device actors starts up with a random state and setting. For instance, a thermostat device may start with an ON state and a temperature setting of 68F whereas a lamp device might have an initial state of OFF and brightness setting of 2. Once instantiated, a Device actor will maintain its internal operational state and setting from then on and will report and update the state and setting per request.
每个设备参与者都以随机状态和设置启动。例如,恒温器设备可能以 ON 状态和 68F 的温度设置开始,而灯设备的初始状态可能为 OFF,亮度设置为 2。实例化后,设备参与者将从此保持其内部操作状态和设置,并将报告和更新每个请求的状态和设置。

Work and WorkResult
工作和工作结果

In this application, a Work object represents a request sent by a specific Device actor and carries the Device’s Id and its current state and setting data. A WorkResult object, on the other hand, represents a returned request for the Device actor to update its state and setting stored within the object.
在此应用程序中,工作对象表示特定设备参与者发送的请求,并携带设备的 ID 及其当前状态和设置数据。另一方面,WorkResult 对象表示返回的请求,要求设备参与者更新存储在对象中的状态和设置。

Responsible for processing the WorkResult generated by the Worker actors, the ResultProcessor actor simulates the processing of work result – in this case it simply sends via the actorSelection method the work result back to the original Device actor through IotManager. Interacting with only the Master cluster system as a cluster client, the Worker actors have no knowledge of the ResultProcessor actor. ResultProcessor receives the work result through subscribing to the Akka distributed pub-sub topic which the Master is the publisher.
负责处理工作角色生成的工作结果,结果处理器参与者模拟工作结果的处理 - 在这种情况下,它只是通过参与者选择方法通过 IotManager 将工作结果发送回原始设备参与者。仅作为群集客户端与主群集系统交互,工作角色不知道结果处理器参与者。结果处理器通过订阅 Akka 分布式发布-订阅主题来接收工作结果,主节点是发布者。

While a participant of the Master cluster actor system, the ResultProcessor actor gets instantiated when the IoT actor system starts up. The decoupling of ResultProcessor instantiation from the Master cluster ensures that no excessive ResultProcessor instances get started when multiple Master cluster nodes start up.
作为主集群参与者系统的参与者,当 IoT 参与者系统启动时,结果处理器参与者会被实例化。将 ResultProcessor 实例化与主集群分离可确保在多个主集群节点启动时不会启动过多的结果处理器实例。

Test running the application
测试运行应用程序

Complete source code of the application is available at GitHub.
该应用程序的完整源代码可在 GitHub 上找到。

To run the application on a single JVM, just git-clone the repo, run the following command at a command line terminal and observe the console output:
要在单个 JVM 上运行应用程序,只需 git 克隆存储库,在命令行终端运行以下命令并观察控制台输出:

The optional NumOfDevices parameter defaults to 20.
可选的设备数参数默认为 20。

To run the application on separate JVMs, git-clone the repo to a local disk, open up separate command line terminals and launch the different components on separate terminals:
要在单独的 JVM 上运行应用程序,请将存储库 git 克隆到本地磁盘,打开单独的命令行终端并在单独的终端上启动不同的组件:

Sample console log
示例控制台日志

Below is filtered console log output from the console tracing the evolving state and setting of a thermostat device:
以下是来自控制台的筛选控制台日志输出,用于跟踪恒温器设备的演变状态和设置:

The following annotated console log showcases fault-tolerance of the master cluster – how it fails over to the 2nd node upon detecting that the 1st node crashes:
以下带注释的控制台日志展示了主集群的容错能力 – 在检测到第一个节点崩溃时,它如何故障转移到第二个节点:

Scaling for production
针对生产进行扩展

The Actor model is well suited for building scalable distributed systems. While the application has an underlying architecture that emphasizes on scalability, it would require further effort in the following areas to make it production ready:
Actor 模型非常适合构建可缩放的分布式系统。虽然应用程序具有强调可伸缩性的底层体系结构,但它需要在以下方面进一步努力才能使其做好生产准备:

  • IotManager uses the ‘ask’ method for message receipt confirmation via a Future return by the Master. If business logic allows, using the fire-and-forget ‘tell’ method will be significantly more efficient especially at scale.
    IotManager使用“ask”方法通过主节点的未来返回来确认消息接收。如果业务逻辑允许,使用即发即弃的“tell”方法将大大提高效率,尤其是在大规模时。
  • The MQTT broker used in the application is a test broker provided by Mosquitto. A production version of the broker should be installed preferably local to the the IoT system. MQTT brokers from other vendors like HiveMQ, RabbitMQ are also available.
    应用程序中使用的MQTT代理是Mosquitto提供的测试代理。代理的生产版本最好安装在物联网系统的本地。来自其他供应商(如HiveMQ,RabbitMQ)的MQTT代理也可用。
  • As displayed in the console log when running the application, Akka’s default Java serializer isn’t best known for its efficiency. Other serializers such as Kryo, Protocol Buffers should be considered.
    如运行应用程序时的控制台日志所示,Akka 的默认 Java 序列化程序并不以其效率而闻名。应考虑其他序列化程序,例如 Kryo 、协议缓冲区。
  • The Redis data store for actor state persistence should be configured for production environment
    应为生产环境配置用于执行组件状态持久性的 Redis 数据存储

Further code changes to be considered
要考虑的进一步代码更改

A couple of changes to the current application might be worth considering:
对当前应用程序的几项更改可能值得考虑:

Device types are currently represented as strings, and code logic for device type-specific states and settings is repeated during instantiation of devices and processing of work requests. Such logic could be encapsulated within classes defined for individual device types. The payload would probably be larger as a consequence, but it might be worth for better code maintainability especially if there are many device types.
设备类型当前表示为字符串,在设备实例化和处理工作请求期间重复设备类型特定状态和设置的代码逻辑。此类逻辑可以封装在为单个设备类型定义的类中。因此,有效负载可能会更大,但对于更好的代码可维护性可能是值得的,尤其是在有许多设备类型的情况下。

Another change to be considered is that Work and WorkResult could be generalized into a single class. Conversely, they could be further differentiated in accordance with specific business needs. A slightly more extensive change would be to retire ResultProcessor altogether and let Worker actors process WorkResult as well.
另一个需要考虑的更改是 Work 和 WorkResult 可以泛化为单个类。相反,可以根据具体的业务需求进一步区分它们。一个稍微更广泛的变化是完全停用 ResultProcessor,并让 Worker actor 也处理 WorkResult。

State mutation in Akka Actors
阿卡演员中的状态突变

In this application, a few actors maintain mutable internal states using private variables (private var):
在此应用程序中,一些参与者使用私有变量(私有变量)维护可变的内部状态:

  • Master
  • IotManager
  • Device

As an actor by-design will never be accessed by multiple threads, it’s generally safe enough to use ‘private var’ to store changed states. But if one prefers state transitioning (as opposed to updating), Akka Actors provides a method to hot-swap an actor’s internal state.
由于 actor 设计永远不会被多个线程访问,因此使用“private var”来存储更改的状态通常足够安全。但是,如果更喜欢状态转换(而不是更新),Akka Actor提供了一种热交换Actor内部状态的方法。

Hot-swapping an actor’s state
热插拔执行组件的状态

Below is a sample snippet that illustrates how hot-swapping mimics a state machine without having to use any mutable variable for maintaining the actor state:
下面是一个示例代码段,说明了热插拔如何模拟状态机,而无需使用任何可变变量来维护 actor 状态:

Simplified for illustration, the above snippet depicts a Worker actor that pulls work from the Master cluster. The context.become method allows the actor to switch its internal state at run-time like a state machine. As shown in the simplified code, it takes an ‘Actor.Receive’ (which is a partial function) that implements a new message handler. Under the hood, Akka manages the hot-swapping via a stack. As a side note, according to the relevant source code, the stack for hot-swapping actor behavior is, ironically, a mutable ‘private var’ of List[Actor.Receive].
为了便于说明,上面的代码片段描述了一个从主集群中提取工作的工作线程参与者。context.become 方法允许参与者像状态机一样在运行时切换其内部状态。如简化代码所示,它需要一个实现新消息处理程序的“Actor.Receive”(这是一个部分函数)。在引擎盖下,Akka 通过堆栈管理热插拔。作为旁注,根据相关的源代码,具有讽刺意味的是,热插拔actor行为的堆栈是List[Actor.Receive]的可变“私有var”。

Recursive transformation of immutable parameter
不可变参数的递归变换

Another functional approach to mutating actor state is via recursive transformation of an immutable parameter. As an example, we can avoid using a mutable ‘private var registry’ as shown in the following ActorManager actor and use ‘context.become’ to recursively transform a registry as an immutable parameter passed to the updateState method:
改变参与者状态的另一种功能方法是通过不可变参数的递归转换。例如,我们可以避免使用可变的“私有 var 注册表”,如以下 ActorManager actor 所示,并使用 “context.become” 递归地将注册表转换为传递给 updateState 方法的不可变参数:

9 thoughts on “Scala IoT Systems With Akka Actors II
关于“Scala IoT Systems with Akka Actors II”的 9 条思考

  1. Alan
    艾伦 十一月6,2017 1时:下午32

    On the Worker.scala class you need to substitute:
    在 Worker.scala 类上,您需要替换:

    val workProcessor = context.watch(context.actorOf(WorkProcessor.props(), “work-processor”))
    val workProcessor = context.watch(context.actorOf(WorkProcessor.props(), “work-processor”))

    for:
    val workProcessor = context.watch(context.actorOf(workProcessorProps, “work-processor”))
    val workProcessor = context.watch(context.actorOf(workProcessorProps, “work-processor”))

    taht way you can reuse the worker class for other types of “work”
    这样,您就可以将 worker 类重用于其他类型的“工作”

    Also, how can I make the Work and WorkResult Generic (Work.scala)? Whenever I add a generic type to those case classes the WorkQueue.scala file complains.
    另外,如何使工作和工作结果通用(Work.scala)?每当我向这些案例类添加泛型类型时,WorkQueue.scala 文件都会抱怨。

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 十一月7,2017 11时:下午46

      Thanks for the feedback, and for pointing out the potential problem with the using of WorkProcessor.Props(), which for some reason crept into the revised version of the Worker class.
      感谢您的反馈,并指出使用 WorkProcessor.Props() 的潜在问题,由于某种原因,它悄悄进入了 Worker 类的修订版本。

      Re: customizing (or generalizing) Work and WorkResult, I left that to the blog readers since it would be largely governed by specific business requirement. As mentioned in the blog post, the master-worker system for work distribution was adapted from a Lightbend template (https://www.lightbend.com/activator/template/akka-distributed-workers). Minimal code change was made in adapting from the template, as the inner-working of the work distribution system isn’t the core focus of the application. Corresponding code change in WorkQueue and probably elsewhere might be inevitable when customizing Work/WorkResult. For instance, WorkQueue consists of code logic that assumes workID of String type being a class member of Work.
      回复:自定义(或概括)工作和工作结果,我把它留给了博客读者,因为它在很大程度上受特定业务需求的约束。如博客文章中所述,用于工作分配的主工作线程系统改编自 Lightbend 模板 ( https://www.lightbend.com/activator/template/akka-distributed-workers )。在适应模板时进行了最小的代码更改,因为工作分配系统的内部工作不是应用程序的核心焦点。自定义工作/工作结果时,工作队列(可能还有其他位置)中的相应代码更改可能是不可避免的。例如,WorkQueue 由代码逻辑组成,该逻辑假定字符串类型的 workID 是 Work 的类成员。

      Reply
      回复 ↓
  2. Silva
    席尔瓦 三月7,2018 1时:下午35

    Hi Cheung,
    your job was great, congratulations.
    你的工作很棒,恭喜你。

    I have an issue. Have you thought about how to make iotManager be fault tolerant and scalable, just like the Master actor, in a simple way?
    我有一个问题。你有没有想过如何让iotManager像Master演员一样,以一种简单的方式实现容错和可扩展?

    Reply
    回复 ↓
  3. Leo Cheung Post author
    张国荣邮报作者 三月7,2018 3时:下午46

    Thanks for the kind words. On fault tolerance, you’re right that IoTManager/Device can be enhanced to operate like the Master actor or leverage cluster sharding (plus persistence journal) feature. It all comes down to specific business requirement.
    谢谢你的客气话。在容错方面,您可以正确增强IoTManager/设备以像主参与者一样运行或利用集群分片(加上持久性日志)功能。这一切都归结为特定的业务需求。

    Reply
    回复 ↓
  4. Konstantinos Chaitas
    康斯坦丁诺斯·柴塔斯 五月31,2019 12时:下午52

    Hello Cheung,

    very nice work and super helpful. One question that I have is regarding the mqtt broker. As far as I understand the mqtt broker is still centralized right ? Is there any way we could make it decentralized/distributed ? I know there are distributed brokers out there e.g Vernemq, but is it possible to make a ‘centralized’ mqtt broker e.g mosquito distributed using Akka and maybe clustering/consistent hashing ? Thanks
    非常好的工作,超级乐于助人。我有一个问题是关于 mqtt 代理的。据我了解,mqtt 代理仍然是中心化的,对吗?有什么方法可以使其去中心化/分布式吗?我知道那里有分布式代理,例如 Vernemq,但是是否有可能制作一个“集中式”mqtt 代理,例如使用 Akka 分发的蚊子,也许是集群/一致性哈希?谢谢

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 六月1,2019 12在:上午15

      I appreciate the kind words. Yes, the Mosquitto broker used in the proof-of-concept application is not distributed. It’s worth noting that the application is supposed to be largely agnostic to the specific MQTT broker being used, thus it would require little to no code change to replace the existing MQTT broker with something else. If you want a distributed broker, I would recommend going with a by-design distributed product (VerneMQ, Mosca, EMQ, etc), rather than trying to repurpose Mosquitto into something it isn’t principally designed for.
      我很欣赏这些客气话。是的,概念验证应用程序中使用的 Mosquitto 代理不是分布式的。值得注意的是,该应用程序应该在很大程度上与正在使用的特定 MQTT 代理无关,因此几乎不需要更改代码即可将现有的 MQTT 代理替换为其他内容。如果你想要一个分布式代理,我建议你使用一个设计好的分布式产品(VerneMQ、Mosca、EMQ 等),而不是试图将 Mosquitto 重新定位为它主要不是为之设计的东西。

      Reply
      回复 ↓
      1. Konstantinos Chaitas
        康斯坦丁诺斯·柴塔斯 六月1,2019 5在:上午47

        Thanks for the fast and helpful response. I am planning to work for my Master thesis on a project which is using the Moquette MQTT broker (very similar to Mosquitto) and it creates a bottleneck in my whole system. Therefore there are mainly 2 options. 1) To replace it with a distributed MQTT broker(VerneMQ, Mosca, EMQ, etc), or 2) to make it distributed. Since the 2nd option looks more interesting and I could learn more things, I was thinking to give it a try using an Akka cluster and maybe consistent hashing to distribute the traffic to the appropriate node/mqtt broker in the cluster. Actually that’s how I found your blog. Do you think that my idea/design could work ? Can you imagine any brokers/drawbacks ? Thanks in advance
        感谢您的快速和有用的回复。我计划为我的硕士论文工作,该项目正在使用Moquette MQTT代理(与Mosquitto非常相似),它在我的整个系统中造成了瓶颈。因此,主要有2种选择。1)用分布式MQTT代理(VerneMQ,Mosca,EMQ等)替换它,或者2)使其分布式。由于第二个选项看起来更有趣,我可以学到更多东西,我想尝试使用 Akka 集群和一致的哈希将流量分配到集群中的相应节点/mqtt 代理。实际上,这就是我找到您的博客的方式。你认为我的想法/设计可行吗?你能想象任何经纪人/缺点吗?提前致谢

        Reply
        回复 ↓
        1. Leo Cheung Post author
          张国荣邮报作者 六月1,2019 7时:下午14

          As an academic project, building a broker cluster with Mosquitto (or similar MQTT brokers) does sound like an interesting exercise, though it probably wouldn’t be equivalent to a full-feature distributed broker without extensive repurposing effort. Perhaps not exactly what you’re aiming at, this Stack Overflow Q&A might be of interest.
          作为一个学术项目,使用 Mosquitto(或类似的 MQTT 代理)构建代理集群听起来确实是一个有趣的练习,尽管如果没有大量的重新调整用途,它可能不等同于一个功能齐全的分布式代理。也许不完全是你的目标,这个Stack Overflow问答可能会感兴趣。

          Reply
          回复 ↓
  5. Pingback: An Akka Actor-based Blockchain | Genuine Blog
    pingback:基于Akka演员的区块链|正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala On Spark – Cumulative Pivot Sum

In a couple of recent R&D projects, I was using Apache Spark rather extensively to address some data processing needs on Hadoop clusters. Although there is an abundance of big data processing platforms these days, it didn’t take long for me to settle on Spark. One of the main reasons is that the programming language for the R&D is Scala, which is what Spark itself is written in. In particular, Spark’s inherent support for functional programming and compositional transformations on immutable data enables high performance at scale as well as readability. Other main reasons are very much in line with some of the key factors attributing to Spark’s rising popularity.
在最近的几个研发项目中,我相当广泛地使用Apache Spark来满足Hadoop集群上的一些数据处理需求。虽然现在有大量的大数据处理平台,但我很快就选择了Spark。其中一个主要原因是研发的编程语言是Scala,这是Spark本身编写的。特别是,Spark 对不可变数据的函数式编程和组合转换的固有支持可实现大规模高性能和可读性。其他主要原因与Spark越来越受欢迎的一些关键因素非常一致。

I’m starting a mini blog series on Scala-on-Spark (SoS) with each blog post demonstrating with some Scala programming example on Apache Spark. In the blog series, I’m going to illustrate how the functionality-rich SoS is able to resolve some non-trivial data processing problems with seemingly little effort. If nothing else, they are good brain-teasing programming exercise in Scala on Spark.
我正在开始一个关于 Scala-on-Spark (SoS) 的迷你博客系列,每篇博客文章都演示了一些关于 Apache Spark 的 Scala 编程示例。在博客系列中,我将说明功能丰富的 SoS 如何能够以看似不费吹灰之力的方式解决一些重要的数据处理问题。如果不出意外,它们是Spark上的Scala中很好的大脑戏弄编程练习。

As the source data for the example, let’s consider a minuscule set of weather data stored in a DataFrame, which consists of the following columns:
作为示例的源数据,让我们考虑存储在 DataFrame 中的一组微小天气数据,该数据集由以下列组成:

  • Weather Station ID
    气象站 ID
  • Start Date of a half-month period
    半个月期间的开始日期
  • Temperature High (in Fahrenheit) over the period
    在此期间温度高(华氏度)
  • Temperature Low (in Fahrenheit) over the period
    在此期间温度低(华氏度)
  • Total Precipitation (in inches) over the period
    期间总降水量(英寸)

Note that with a properly configured Spark cluster, the methods illustrated in the following example can be readily adapted to handle much more granular data at scale – e.g. down to sub-hourly weather data from tens of thousands of weather stations. It’s also worth mentioning that there can be other ways to solve the problems presented in the examples.
请注意,使用正确配置的Spark集群,以下示例中所示的方法可以很容易地适应处理更精细的大规模数据 - 例如,来自数万个气象站的每小时以下天气数据。还值得一提的是,还有其他方法可以解决示例中提出的问题。

For illustration purpose, the following code snippets are executed on a Spark Shell. First thing is to generate a DataFrame with the said columns of sample data, which will be used as source data for this example and a couple following ones.
为了便于说明,以下代码片段在 Spark 外壳上执行。第一件事是生成一个包含上述示例数据列的数据帧,该数据帧将用作此示例和后续几个示例的源数据。

In this first example, the goal is to generate a table of cumulative precipitation by weather stations in month-by-month columns. By ‘cumulative sum’, it means the monthly precipitation will be cumulated from one month over to the next one (i.e. rolling sum). In other words, if July’s precipitation is 2 inches and August’s is 1 inch, the figure for August will be 3 inches. The result should look like the following table:
在第一个示例中,目标是按月列生成气象站累积降水量表。通过“累计总和”,这意味着每月降水量将从一个月累积到下一个月(即滚动总和)。换句话说,如果 7 月的降水量为 2 英寸,8 月的降水量为 1 英寸,则 8 月的数字将为 3 英寸。结果应如下表所示:

First, we transform the original DataFrame to include an additional year-month column, followed by using Spark’s groupBy, pivot and agg methods to generate the pivot table.
首先,我们将原始数据帧转换为包含额外的年月列,然后使用 Spark 的 groupBy、透视和 agg 方法来生成数据透视表。

Next, we assemble a list of the year-month columns and traverse the list using method foldLeft, which is one of the most versatile Scala functions for custom iterative transformations. In this particular case, the data to be transformed by foldLeft is a tuple of (DataFrame, Double). Normally, transforming the DataFrame alone should suffice, but in this case we need an additional value to address to rolling cumulation requirement.
接下来,我们组装一个年月列的列表,并使用方法 foldLeft 遍历该列表,这是用于自定义迭代转换的最通用的 Scala 函数之一。在这种特殊情况下,要由 foldLeft 转换的数据是 (DataFrame, Double) 的元组。通常,仅转换数据帧就足够了,但在这种情况下,我们需要一个额外的值来满足滚动累积要求。

The tuple’s first DataFrame-type element, with monthlyPrecipDF as its initial value, will be transformed using the binary operator function specified as foldLeft’s second argument (i.e. (acc, c) => …). As for the tuple’s second Double-type element, with the first year-month as its initial value it’s for carrying the current month value over to the next iteration. The end result is a (DataFrame, Double) tuple successively transformed month-by-month.
元组的第一个 DataFrame 类型元素,以 monthlyPrecipDF 作为其初始值,将使用指定为 foldLeft 第二个参数的二进制运算符函数进行转换(即 (acc, c) => ...)。至于元组的第二个 Double-type 元素,以第一个年月作为其初始值,用于将当前月份值转移到下一次迭代。最终结果是一个(数据帧,双精度)元组逐月连续转换。

Similar pivot aggregations can be applied to temperature high’s/low’s as well, with method sum replaced with method max/min.
类似的枢轴聚合也可以应用于温度高/低,方法总和替换为方法最大值/最小值。

Finally, we compute cumulative temperature high/low like cumulative precipitation, by replacing method sum with iterative max/min using Spark’s when-otherwise method.
最后,我们计算累积温度高/低,如累积降水,通过使用Spark的when-else方法将方法和替换为迭代最大值/最小值。

4 thoughts on “Scala On Spark – Cumulative Pivot Sum
关于“Scala On Spark – 累积枢轴总和”的 4 条思考

  1. Pingback: Scala On Spark – Sum Over Periods | Genuine Blog
    pingback: 斯卡拉在火花 – 期间的总和 |正版博客

  2. Pingback: Scala On Spark – Streak | Genuine Blog
    Pingback: Scala On Spark – Streak |正版博客

  3. Pingback: Scala On Spark – Word-pair Count | Genuine Blog
    pingback: Scala On Spark – 字对计数 |正版博客

  4. Pingback: Spark – Interpolating Time Series Data | Genuine Blog
    pingback: Spark – 插值时间序列数据 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala On Spark – Sum Over Periods
Scala on Spark – Sum over Period(英语:Sum on Spark – Sum over Period)

This is another programming example in my Scala-on-Spark blog series. While it uses the same minuscule weather data created in the first example of the blog series, it can be viewed as an independent programming exercise.
这是我的 Scala-on-Spark 博客系列中的另一个编程示例。虽然它使用在博客系列的第一个示例中创建的相同微小天气数据,但它可以被视为一个独立的编程练习。

In this example, we want a table of total precipitation over custom past periods by weather stations. The specific periods in this example are the previous month, previous 3 months, and all previous months. We have data from July through December, and let’s say it’s now January hence the previous month is December.
在此示例中,我们需要一个按气象站自定义的过去时间段的总降水量表。此示例中的特定期间为上个月、前 3 个月和所有前几个月。我们有 7 月至 12 月的数据,假设现在是 1 月,因此上个月是 12 月。

The result should be like this:
结果应该是这样的:

User-defined functions (UDF) will be used in this example. Spark’s UDF supplements its API by allowing the vast library of Scala (or any of the other supported languages) functions to be used. That said, a method from Spark’s API should be picked over an UDF of same functionality as the former would likely perform more optimally.
本例将使用用户定义函数(UDF)。Spark的UDF通过允许使用庞大的Scala(或任何其他支持的语言)函数库来补充其API。也就是说,应该选择Spark的API中的方法,而不是具有相同功能的UDF,因为前者可能会以最佳方式执行。

First, let’s load up the said weather data.
首先,让我们加载上述天气数据。

We first create a DataFrame of precipitation by weather station and month, each with the number of months that lag the current month.
我们首先按气象站和月份创建一个降水数据帧,每个数据帧都有滞后于当月的月数。

Next, we combine the list of months-lagged with monthly precipitation by means of a UDF to create a map column. To do that, we use Scala’s zip method within the UDF to create a list of tuples from the two input lists and convert the resulting list into a map.
接下来,我们通过 UDF 将月滞后列表与月降水量相结合,创建一个地图列。为此,我们在 UDF 中使用 Scala 的 zip 方法从两个输入列表创建一个元组列表,并将结果列表转换为映射。

Note that the map content might look different depending on when it is generated, as the months-lagged is relative to the current month when the application is run.
请注意,地图内容可能看起来会有所不同,具体取决于其生成时间,因为月滞后相对于应用程序运行时的当前月份。

Using another UDF to sum precipitation counting backward from the previous months based on the number of months lagged, we create the result DataFrame.
使用另一个 UDF 根据滞后的月数对前几个月的降水量进行求和,我们创建结果数据帧。

Again, note that the months-lagged is relative to the current month when the application is executed, hence the months-lagged parameters for the aggMapValues UDF should be adjusted accordingly.
同样,请注意,月份滞后是相对于执行应用程序的当前月份,因此应相应地调整 aggMapValues UDF 的月份滞后参数。

We can use similar approach to come up with a table for temperature high/low over the custom periods. Below are the steps for creating the result table for temperature high.
我们可以使用类似的方法来提出自定义时间段内的温度高/低表。以下是创建高温结果表的步骤。

I’ll leave creating the temperature low result table as a programming exercise for the readers. Note that rather than calculating temperature high and low separately, one could aggregate both of them together in some of the steps with little code change. For those who are up for a slightly more challenging exercise, both temperature high and low data can in fact be transformed together in every step of the way.
我将创建低温结果表作为读者的编程练习。请注意,与其分别计算温度高低,不如在某些步骤中将它们聚合在一起,而无需更改代码。对于那些准备进行更具挑战性的运动的人来说,温度高和低数据实际上可以在每一步中一起转换。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala On Spark – Streak
斯卡拉火花 – 连胜

This is yet another programming example in my Scala-on-Spark blog series. Again, while it starts with the same minuscule weather data used in previous examples of the blog series, it can be viewed as an independent programming exercise.
这是我的 Scala-on-Spark 博客系列中的另一个编程示例。同样,虽然它从博客系列的先前示例中使用的相同微小天气数据开始,但它可以被视为一个独立的编程练习。

In this example, we’re going to create a table that shows the streaks of consecutive months with non-zero precipitation.
在此示例中,我们将创建一个表,该表显示非零降水量的连续月份的连胜数。

Result should be similar to the following:
结果应类似于以下内容:

We’ll explore using Spark’s window functions in this example. As a side note, some of the previous examples in the blog series could be resolved using window functions as well. By means of aggregating over partitioned sliding windows of data, Spark’s window functions readily perform certain kinds of complex aggregations which would otherwise require repetitive nested groupings. They are similar to how PostgreSQL’s window functions work.
在此示例中,我们将探索如何使用 Spark 的窗口函数。作为旁注,博客系列中的一些先前示例也可以使用窗口函数来解决。通过聚合分区的数据滑动窗口,Spark的窗口函数可以轻松执行某些类型的复杂聚合,否则这些聚合将需要重复的嵌套分组。它们类似于PostgreSQL的窗口函数的工作方式。

Now, let’s load up the same old minuscule weather data.
现在,让我们加载相同的旧微不足道的天气数据。

First, create a DataFrame of precipitation by weather station and month and filter it to consist of only months with positive precipitation.
首先,按气象站和月份创建降水数据帧,并将其筛选为仅包含正降水的月份。

Next, using window function, we capture sequences of row numbers ordered by month over partitions by weather station. For each row, we then use an UDF to calculate the base date by dating back from the corresponding month of the row in accordance with the row number. As shown in the following table, these base dates help trace chunks of contiguous months back to their common base dates.
接下来,使用窗口函数,我们捕获按月排序的行号序列,按气象站划分。对于每一行,我们使用 UDF 根据行号从行的相应月份追溯来计算基准日期。如下表所示,这些基准日期有助于将连续月份的大块追溯到其公共基准日期。

Finally, we apply another row-number window function, but this time, over partitions by weather station as well as base date. This partitioning allows contiguous common base dates to generate new row numbers as the wanted streaks.
最后,我们应用另一个行号窗口函数,但这次是按气象站和基准日期对分区进行。此分区允许连续的公共基日期生成新的行号作为所需的条纹。

Using the same logic flow, we can also generate similar streak reports for temperature high/low (e.g. streak of temperature high above 75F). I’ll leave that as exercise for the readers.
使用相同的逻辑流程,我们还可以生成类似的温度高/低条纹报告(例如,高于 75F 的温度条纹)。我会把它留给读者练习。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala On Spark – Word-pair Count
Scala On Spark – 字对计数

So far, the few programming examples in the SoS (Scala on Spark) blog series have all centered around DataFrames. In this blog post, I would like to give an example on Spark’s RDD (resilient distributed data), which is an immutable distributed collection of data that can be processed via functional transformations (e.g. map, filter, reduce).
到目前为止,SoS(Scala on Spark)博客系列中的少数编程示例都以DataFrames为中心。在这篇博文中,我想举一个关于Spark的RDD(弹性分布式数据)的例子,它是一个不可变的分布式数据集合,可以通过功能转换(例如map,filter,reduce)进行处理。

The main difference between the RDD and DataFrame APIs is that the former provides more granular low-level functionality whereas the latter is equipped with powerful SQL-style functions to process table-form data. Note that even though a DataFrame is in table form with named columns, the underlying JVM only treats each row of the data a generic untyped object. As a side note, Spark also supports another data abstraction called Dataset, which is a distributed collection of strongly-typed objects.
RDD和数据帧API之间的主要区别在于,前者提供了更精细的低级功能,而后者配备了强大的SQL样式函数来处理表格式数据。请注意,即使数据帧是带有命名列的表形式,底层 JVM 也只将数据的每一行视为通用的非类型化对象。作为旁注,Spark还支持另一个称为数据集的数据抽象,它是强类型对象的分布式集合。

Back to the RDD world. In this programming exercise, our goal is to count the number of occurrences of every distinct pair of consecutive words in a text file. In essence, for every given distinct word in a text file we’re going to count the number of occurrences of all distinct words following the word. As a trivial example, if the text is “I am what I am”, the result should be (i, am) = 2, (what, i) = 1, (am, what) = 1.
回到RDD世界。在此编程练习中,我们的目标是计算文本文件中每个不同的连续单词对的出现次数。从本质上讲,对于文本文件中的每个给定不同单词,我们将计算该单词后面所有不同单词的出现次数。举个简单的例子,如果文本是“我就是我”,结果应该是 (i, am) = 2, (what, i) = 1, (am, what) = 1。

For illustration purpose, let’s assemble a small piece of text as follows and save it in a file, say in a Hadoop HDFS file system:
为了便于说明,让我们按如下方式组合一小段文本并将其保存在一个文件中,例如在Hadoop HDFS文件系统中:

Simple word count
简单的字数统计

As a warm-up exercise, let’s perform a hello-world word count, which simply reports the count of every distinct word in a text file. Using the ‘textFile()’ method in SparkContext, which serves as the entry point for every program to be able to access resources on a Spark cluster, we load the content from the HDFS file:
作为热身练习,让我们执行一个hello-world字数统计,它只是报告文本文件中每个不同单词的计数。使用 SparkContext 中的 'textFile()' 方法,作为每个程序能够访问 Spark 集群上资源的入口点,我们从 HDFS 文件中加载内容:

Viewed as a collection of lines (delimited by carriage returns), we first use ‘flatMap’ to split each line of the text by punctuations into an array of words then flatten the arrays. Note that ‘_.split()’ is just a Scala short-hand for ‘line => line.split()’.
作为行的集合(由回车符分隔),我们首先使用“flatMap”通过标点符号将文本的每一行拆分为一个单词数组,然后展平数组。请注意,'_.split()' 只是 'line => line.split()' 的 Scala 简写。

Next, all words are lowercased (to disregard cases) with the transformation ‘word => word.toLowerCase’, followed by a map transformation ‘word => (word, 1)’ for tallying. Using ‘reduceByKey’, the reduction transformation ‘(total, count) => total + count’ (short-handed as ‘(_ + _)’) for each key transforms every word into a tuple of (word, totalcount). The final sorting is just for ordering the result by count.
接下来,所有单词都小写(忽略大小写),转换“word => word.toLowerCase”,然后是映射转换“word =>(单词,1)”进行计数。使用“reduceByKey”,每个键的缩减转换“(总计,计数)=>总数+计数”(简称为“(_ + _)”)将每个单词转换为(单词,总数)的元组。最终排序仅用于按计数对结果进行排序。

Since the dataset is small, we can ‘collect’ the result data to see the output:
由于数据集很小,我们可以“收集”结果数据以查看输出:

On a related note, Spark’s ‘reduceByKey()’ along with a couple of other ‘xxxxByKey()’ functions are handy tools for this kind of key-value pair transformations. Had they not been provided, one would have to do it with a little more hand-crafting work like:
与此相关的是,Spark的“reduceByKey()”以及其他几个“xxxxByKey()”函数是这种键值对转换的便捷工具。如果没有提供它们,人们将不得不通过更多的手工制作工作来完成,例如:

Word-pair count

Now, let’s move onto the main topic of this blog post – counting distinct pairs of consecutive words:
现在,让我们进入这篇博文的主要主题 - 计算不同的连续单词对:

Even though the required logic for counting word pairs is apparently more complex than that of counting individual words, the necessary transformations look only slightly different. It’s partly due to how compositions of modularized functions can make complex data transformations look seemingly simple in a functional programming language like Scala. Another key factor in this case is the availability of the powerful ‘sliding(n)’ function, which transforms a collection of elements into sliding windows each in the form of an array of size ‘n’. For example, applying sliding(2) to a sequence of words “apples”, “and”, “oranges” would result in Array(“apples”, “and”) and Array(“and”, “oranges”).
尽管计算单词对所需的逻辑显然比计算单个单词所需的逻辑更复杂,但必要的转换看起来只是略有不同。这部分是由于模块化函数的组合如何使复杂的数据转换在像Scala这样的函数式编程语言中看起来很简单。在这种情况下,另一个关键因素是强大的“sslideing(n)”函数的可用性,该功能将元素集合转换为滑动窗口,每个窗口都以大小为“n”的数组的形式出现。例如,将 sslideing(2) 应用于一系列单词 “apples”、“and”、“oranges” 将生成 Array(“apples”、“and”) 和 Array(“and”, “oranges”)。

Scanning through the compositional functions, the split by punctuations and lowercasing do exactly the same thing as in the hello-world word count case. Next, ‘sliding(2)’ generates sliding window of word pairs each stored in an array. The subsequent ‘map’ each of the word-pair arrays into a key/value tuple with the word-pair-tuple being the key and 1 being the count value.
浏览组合功能,按标点符号和小写进行拆分与hello-world字数统计情况完全相同。接下来,'sslideing(2)' 生成存储在数组中的字对的滑动窗口。随后将每个字对数组“映射”成一个键/值元组,其中字对元组是键,1 是计数值。

Similar to the reduction transformation in the hello-world word count case, ‘reduceByKey()’ generates count for each word pair. Result is then sorted by count, 1st word in word-pair, 2nd word in word-pair. Output of the word-pair count using ‘collect’ is as follows:
与hello-world字数统计案例中的缩减变换类似,“reduceByKey()”为每个单词对生成计数。然后按计数对结果排序,单词对中的第一个单词,单词对中的第二个单词。使用“collect”的字对计数输出如下:

Creating a word-pair count method
创建字对计数方法

The above word-pair counting snippet can be repurposed to serve as a general method for counting a specific word-pair in a text file:
上面的字对计数片段可以重新用作对文本文件中特定字对进行计数的通用方法:

It’s worth noting that Scala’s collect method (not to be confused with Spark’s RDD ‘collect’ method) has now replaced method ‘map’ in the previous snippet. It’s because we’re now interested in counting only the specific word-pair word1 and word2, thus requiring the inherent filtering functionality from method ‘collect’. Also note that in the ‘case’ statement the pair of words are enclosed in backticks to refer to the passed-in words, rather than arbitrary pattern-matching variables.
值得注意的是,Scala的collect方法(不要与Spark的RDD“collect”方法混淆)现在已经取代了上一个代码片段中的方法“map”。这是因为我们现在只对计算特定的单词对word1和word2感兴趣,因此需要方法“collect”的固有过滤功能。另请注意,在 'case' 语句中,这对单词括在反引号中以引用传入的单词,而不是任意模式匹配变量。

To use the word-pair count method, simply provide the pair of consecutive words and the file path as parameters, along with the SparkContext to be passed in an implicit parameter. For example:
若要使用字对计数方法,只需提供连续的单词对和文件路径作为参数,以及要在隐式参数中传递的 SparkContext。例如:

5 thoughts on “Scala On Spark – Word-pair Count
关于“Scala On Spark – 字对计数”的 5 条思考

  1. Claudia Alves
    克劳迪娅·阿尔维斯 七月27,2018 1在:上午48

    Great post. It definitely has increased my knowledge on Spark. Please keep sharing similar write ups of yours. You can check this too for Spark tutorial as i have recorded this recently on Spark. and i’m sure it will be helpful to you. https://www.youtube.com/watch?v=8Kcu63H0d8c
    很棒的帖子。它无疑增加了我对Spark的了解。请继续分享您的类似文章。您也可以在 Spark 教程中检查这一点,因为我最近在 Spark 上记录了这个。我相信它会对你有所帮助。 https://www.youtube.com/watch?v=8Kcu63H0d8c

    Reply
    回复 ↓
  2. Russell
    罗素 2月4,在2019 6:38上午

    Very nicely explained. Got to save this site in my reading list
    解释得很好。必须将此站点保存在我的阅读列表中

    Reply
    回复 ↓
  3. vijay
    维杰 十二月6,2020 8在:上午49

    Sliding doesn’t seem to be working for me, even other window operations are not working
    滑动似乎对我不起作用,甚至其他窗口操作也不起作用

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 十二月6,2020 1时:下午02

      To use sliding for RDDs, you’ll need to import RDDFunctions from MLlib as included upfront in the sample code:
      若要将 sliding 用于 RDD,需要从 MLlib 导入 RDDFunctions ,如示例代码中预先包含的那样:

      import org.apache.spark.mllib.rdd.RDDFunctions._

      Reply
      回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

HTTPS Redirection With Akka HTTP
使用 Akka HTTP 进行 HTTPS 重定向

Akka HTTP is a HTTP-based toolkit built on top of Akka Stream. Rather than a framework for rapid web server development, it’s principally designed as a suite of tools for building custom integration layers to wire potentially complex business logic with a REST/HTTP interface. Perhaps for that reason, one might be surprised that there isn’t any example code for something as common as running a HTTPS-by-default web server.
Akka HTTP是一个基于HTTP的工具包,建立在Akka Stream之上。它不是用于快速Web服务器开发的框架,而是主要设计为一套工具,用于构建自定义集成层,以将可能复杂的业务逻辑与REST/HTTP接口连接起来。也许出于这个原因,人们可能会惊讶地发现,没有任何示例代码可以像运行HTTPS默认Web服务器一样常见。

Almost every major website operates using the HTTPS protocol by default for security purpose these days. Under the protocol, the required SSL certificate and the bidirectional encryption of the communications between the web server and client does ensure the authenticity of the website as well as avoid man-in-the-middle attack. It might be an over-kill for, say, an information-only website, but the ‘lock’ icon indicating a valid SSL certificate on the web browser address bar certainly makes site visitors feel more secure.
如今,出于安全目的,几乎每个主要网站都默认使用 HTTPS 协议运行。根据该协议,所需的SSL证书以及Web服务器和客户端之间通信的双向加密确实确保了网站的真实性,并避免了中间人攻击。例如,对于仅提供信息的网站来说,这可能是一个矫枉过正,但是Web浏览器地址栏上指示有效SSL证书的“锁定”图标肯定会使网站访问者感到更安全。

In this blog post, I’ll assemble a snippet using Akka HTTP to illustrate how to set up a skeletal web server which redirects all plain-HTTP requests to the HTTPS listener. For testing purpose in a development environment, I’ll also include steps of creating a self-signed SSL certificate. Note that such self-signed certificate should only be used for internal testing purpose.
在这篇博文中,我将使用 Akka HTTP 组装一个片段来说明如何设置一个骨架 Web 服务器,将所有纯 HTTP 请求重定向到 HTTPS 侦听器。为了在开发环境中进行测试,我还将包括创建自签名 SSL 证书的步骤。请注意,此类自签名证书只能用于内部测试目的。

HTTP and HTTPS cannot serve on the same port
HTTP 和 HTTPS 不能在同一端口上提供服务

Intuitively, one would consider binding both HTTP and HTTPS services to the same port on which all requests are processed by a HTTPS handler. Unfortunately, HTTPS uses SSL/TLS protocol which for security reason can’t be simply downgraded to HTTP upon detecting unencrypted requests. A straight-forward solution would be to bind the HTTP and HTTPS services to separate ports and redirect all requests coming into the HTTP port to the HTTPS port.
直观地说,人们会考虑将HTTP和HTTPS服务绑定到HTTPS处理程序处理所有请求的同一端口。不幸的是,HTTPS使用SSL / TLS协议,出于安全原因,在检测到未加密的请求时不能简单地降级为HTTP。一个简单的解决方案是将HTTP和HTTPS服务绑定到单独的端口,并将进入HTTP端口的所有请求重定向到HTTPS端口。

First let’s create ‘build.sbt’ with necessary library dependencies under the project root subdirectory:
首先,让我们在项目根子目录下创建具有必要库依赖项的“build.sbt”:

Next, create the main application in, say, ${project-root}/src/main/SecureServer.scala:
接下来,在${project-root}/src/main/SecureServer.scala中创建主应用程序:

The top half of the main code are initialization routines for the Akka actor system, stream materializer (which are what Akka HTTP is built on top of) and creating HTTPS connection context. The rest of the code is a standard Akka HTTP snippet with URL routing and server port binding. A good portion of the code is borrowed from this Akka server-side HTTPS support link.
主代码的上半部分是 Akka actor 系统的初始化例程、流物化器(这是 Akka HTTP 构建的基础)和创建 HTTPS 连接上下文。其余代码是具有URL路由和服务器端口绑定的标准Akka HTTP代码段。很大一部分代码是从这个 Akka 服务器端 HTTPS 支持链接借用的。

Within the ‘scheme(“http”)’ routing code block is the core logic for HTTPS redirection:
在“scheme(”http“)”路由代码块中是HTTPS重定向的核心逻辑:

Note that there is no need for applying ‘withAuthority()’ if you’re using standard HTTPS port (i.e. 443).
请注意,如果您使用的是标准HTTPS端口(即443),则无需应用“withAuthority()”。

Next step would be to put in place the PKCS #12 formatted file, ‘server.p12’, which consists of the PKCS private key and X.509 SSL certificate. It should be placed under ${project-root}/src/main/resources/. At the bottom of this blog post are steps for creating the server key/certificate using open-source library, OpenSSL.
下一步是放置PKCS #12格式的文件“server.p12”,该文件由PKCS私钥和X.509 SSL证书组成。它应该放在${project-root}/src/main/resources/下。本博客文章的底部是使用开源库 OpenSSL 创建服务器密钥/证书的步骤。

Once the private key/certificate is in place, to run the server application from a Linux command prompt, simply use ‘sbt’ as below:
私钥/证书就位后,要从 Linux 命令提示符运行服务器应用程序,只需使用“sbt”,如下所示:

To test it out from a web browser, visit http://dev.genuine.com:8080/hello and you should see the URL get redirected to https://dev.genuine.com:8443/hello. The web browser will warn about security of the site and that’s just because the SSL certificate is a self-signed one.
要从 Web 浏览器对其进行测试,请访问 http://dev.genuine.com:8080/hello,您应该会看到 URL 被重定向到 https://dev.genuine.com:8443/hello。Web浏览器将警告站点的安全性,这只是因为SSL证书是自签名证书。

Generating server key and self-signed SSL certificate in PKCS #12 format
生成 PKCS #12 格式的服务器密钥和自签名 SSL 证书

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Content-based Substreaming
Akka 基于内容的子流

In a previous blog post, a simple text mining application was developed using Akka Streams. In that application, the graph-building create() DSL method was used to build the stream topology that consists of the routing logic of the various stream components.
在之前的一篇博客文章中,一个简单的文本挖掘应用程序是使用Akka Streams开发的。在该应用程序中,图形构建 create() DSL 方法用于构建由各种流组件的路由逻辑组成的流拓扑。

This time, we’re going to try something a little different. Rather than using the create() graph-builder, we’ll directly use some fan-out/fan-in functions to process stream data. In particular, we’ll process a messaging stream and dynamically demultiplex the stream into substreams based on the group the messages belong to.
这一次,我们将尝试一些不同的东西。我们将直接使用一些扇出/扇入函数来处理流数据,而不是使用 create() 图形生成器。特别是,我们将处理消息流,并根据消息所属的组将流动态解复用为子流。

For illustration purpose, the messaging element in the stream source is modeled as a case class with group and content info. The substreams dynamically split by message group will go into individual file sinks, each of which is wrapped along with the file name in another case class.
出于说明目的,流源中的消息传递元素建模为具有组和内容信息的案例类。按消息组动态拆分的子流将进入单独的文件接收器,每个文件接收器都与文件名一起包装在另一个案例类中。

Next, we create a map with message group as key and the corresponding sink (which has the actual file path) as value:
接下来,我们创建一个映射,将消息组作为键,并将相应的接收器(具有实际文件路径)作为值:

Demultiplexing via Akka Streams groupBy()
通过 Akka Streams groupBy() 解除复用

We then use Akka Streams groupBy() to split the stream by message group into substreams:
然后,我们使用 Akka Streams groupBy() 按消息组将流拆分为子流:

Note that after applying groupBy() (see method signature), the split substreams are processed in parallel for each message group using mapAsync() which transforms the stream by applying a specified function to each of the elements. Since our goal is to create the individual files, the final ‘mergeSubstreams’ is just for combining the substreams to be executed with an unused sink.
请注意,在应用 groupBy() 之后(参见方法签名),使用 mapAsync() 为每个消息组并行处理拆分的子流,该消息组通过将指定的函数应用于每个元素来转换流。由于我们的目标是创建单个文件,因此最终的“合并子流”仅用于将要执行的子流与未使用的接收器组合在一起。

Putting all the above pieces altogether:
把以上所有部分放在一起:

Merging stream sources with flatMapConcat
使用 flatMapConcat 合并流源

Conversely, given the split files, we can merge them back into a single stream using flatMapConcat. The word ‘merge’ is being used loosely here, as flatMapConcat (see method signature) actually consumes the individual sources one after another by flattening the source stream elements using concatenation.
相反,给定拆分文件,我们可以使用 平面地图Concat .“merge”这个词在这里被松散地使用,因为flatMapConcat(参见方法签名)实际上通过使用连接来扁平化源流元素来一个接一个地消耗各个源。

In case the files are large and processing them in measured chunks for memory reservation is preferred:
如果文件很大,并且最好以测量的块处理它们以进行内存预留:

Line-feed (i.e. “\n”) is being used as the delimiter for each frame here, but it can be set to anything else.
换行符(即“\n”)在这里用作每个帧的分隔符,但可以设置为其他任何内容。

2 thoughts on “Akka Content-based Substreaming
关于“Akka基于内容的子流”的2条思考

  1. bill
    账单 五月21,2018 12时:下午14

    I high appreciate this post. It’s hard to find the good from the bad sometimes, but I think you’ve nailed it! would you mind updating your blog with more information?
    我高度赞赏这篇文章。有时很难从坏中找到好,但我认为你已经做到了!您介意用更多信息更新您的博客吗?

    Reply
    回复 ↓
  2. Pingback: Merging Akka Streams With MergeLatest | Genuine Blog
    pingback: 将 Akka 流与合并最新 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Generic Top N Elements In Scala
Scala 中的通用前 N 个元素

Getting the top N elements from a list of elements is a common need in applications that involve data retrieval. If the list is big, it’ll be inefficient and wasteful (in terms of processing resource) to sort the entire list when one is interested in only the top few elements.
从元素列表中获取前 N 个元素是涉及数据检索的应用程序的常见需求。如果列表很大,当一个人只对前几个元素感兴趣时,对整个列表进行排序将是低效和浪费的(就处理资源而言)。

Consider a list of real numbers (i.e. Double or Float-typed) and let’s say we want to fetch the smallest N numbers from the list. A commonly used algorithm for the task is rather straight forward:
考虑一个实数列表(即双精度或浮点型),假设我们想从列表中获取最小的 N 个数字。该任务的常用算法相当简单:

Start with the first N numbers of the list as the selected N-element sublist, then check for each of the remaining numbers of the list and if it’s smaller than the largest number in the sublist swap out in each iteration the largest number with it.
从列表的前 N 个数字作为选定的 N 元素子列表开始,然后检查列表中的剩余每个数字,如果它小于子列表中的最大数字,则在每次迭代中交换出最大的数字。

Algorithmic steps

Formulating that as programmatic steps, we have:
将其表述为程序化步骤,我们有:

  1. Maintain a sorted N-element list in descending order, hence its head is the max in the list (Assuming N isn’t a big number, the cost of sorting is trivial).
    按降序维护一个排序的 N 元素列表,因此它的头部是列表中的最大值(假设 N 不是一个大数字,排序的成本是微不足道的)。
  2. In each iteration, if the current element in the original list is smaller than the head element in the N-element list, replace the head element with the current element; otherwise leave the current N-element list unchanged.
    在每次迭代中,如果原始列表中的当前元素小于 N 元素列表中的 head 元素,则将 head 元素替换为当前元素;否则,保持当前 N 元素列表不变。

Upon completing the iterations, the N-element list will consist of the smallest elements in the original list and a final sorting in ascending order will result in a sorted N-element list:
完成迭代后,N 元素列表将由原始列表中最小的元素组成,按升序进行的最终排序将生成排序的 N 元素列表:

Note that it would be trivial to modify the method to fetch the largest N numbers (instead of the smallest) in which case one only needs to reverse the inequality operator in ‘e < l.head’, the iterative ‘sorthWith(_ > _)’ and the final ‘sorthWith(_ < _)’.
请注意,修改方法以获取最大的 N 个数字(而不是最小的数字)是微不足道的,在这种情况下,只需要反转 'e < l.head' 中的不等式运算符、迭代的 'sorthWith(_ > _)' 和最终的 'sorthWith(_ < _)'。

Refactoring to eliminate sorting
重构以消除排序

Now, let’s say we’re going to use it to fetch some top N elements where N is a little bigger, like top 5,000 from a 1 million element list. Except for the inevitable final sorting of the sublist, all the other ‘sorthWith()’ operations can be replaced with something less expensive. Since all we care is to be able to conditionally swap out the largest number in the sublist, we just need the largest number to be placed at the head of the sublist and the same algorithmic flow will work fine.
现在,假设我们将使用它来获取一些前 N 个元素,其中 N 稍大一些,例如 100 万个元素列表中的前 5,000 个元素。除了不可避免的子列表最终排序之外,所有其他的“sorthWith()”操作都可以用更便宜的操作替换。由于我们所关心的只是能够有条件地交换出子列表中的最大数字,因此我们只需要将最大的数字放在子列表的顶部,并且相同的算法流程就可以正常工作。

The refactored topN method below replaces all ‘sortWith()’ (except for the final sorting) with ‘bigHead()’ which places the largest number of the input list at its head position:
下面重构的 topN 方法将所有 'sortWith()' (除了最终排序)替换为 'bigHead()',将输入列表中最大的数字放在其头部位置:

Generic top N numeric elements
通用前 N 个数字元素

Next, we generalize method topN to handle any list of elements of the type implicitly associated with Numeric, with a typeclass pattern known as context bound.
接下来,我们推广方法 topN 以处理与 Numeric 隐式关联的类型的任何元素列表,具有称为上下文绑定的类型类模式。

With the context bound for type ‘T’, importing the implicit mkNumericOps and mkOrderingOps methods makes arithmetic and comparison operators available for the list elements to be compared and ordered.
将上下文绑定为类型“T”时,导入隐式 mkNumericOps 和 mkOrderingOps 方法可使算术运算符和比较运算符可用于要比较和排序的列表元素。

Generic top N objects ordered by mapped values
按映射值排序的通用前 N 个对象

To further generalize topN, rather than being limited to numeric elements we enable it to take a list of generic objects and return the top N elements ordered by the individual element’s corresponding value (e.g. an order-able class field of the object). To accomplish that, we revise topN as follows:
为了进一步推广 topN,而不是局限于数字元素,我们使其能够获取通用对象列表并返回按单个元素的相应值排序的前 N 个元素(例如对象的可排序类字段)。为此,我们将 topN 修改如下:

  • Loosen the context bound from ‘Numeric’ to the more generic Ordering so that items can be ordered by non-numeric values such as strings
    将上下文绑定从“数字”放宽到更通用的排序,以便可以按非数字值(如字符串)对项目进行排序
  • Take as an additional parameter a mapping function that tells what values corresponding to the objects are to be ordered by
    将映射函数作为附加参数,该函数告诉与对象对应的值将按以下顺序排序

Note that the type parameter ‘T : Ordering’, which signifies a context bound, is the shorthand notation for:
请注意,表示上下文绑定的类型参数“T : Ordering”是以下各项的简写表示法:

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Patching Numeric Sequence In Scala
在 Scala 中修补数字序列

Like fetching top N elements from a sequence of comparable elements, patching numeric sequence is also a common need, especially when processing data that isn’t complete or clean. By “patching”, I mean interpolating missing spots in a list of numbers. A simplistic patch or interpolation is to fill a missing number with the average of the previous few numbers.
就像从一系列可比较的元素中获取前 N 个元素一样,修补数字序列也是一种常见的需求,尤其是在处理不完整或干净的数据时。通过“修补”,我的意思是在数字列表中插入缺失的点。简单的补丁或插值是用前几个数字的平均值填充缺失的数字。

For example, given the following list of numbers:
例如,给定以下数字列表:

60, 10, 50, (), 20, 90, 40, 80, (), (), 70, 30

we would like to replace each of the missing numbers with the average of, say, its previous 3 numbers. In this case, the leftmost missing number should be replace with 40 (i.e. (60 + 10 + 50) / 3).
我们想用前 3 个数字的平均值替换每个缺失的数字。在这种情况下,最左边缺少的数字应替换为 40(即 (60 + 10 + 50) / 3)。

Below is a simple snippet that patches missing numbers in a Double-type sequence with the average of the previous N numbers. The missing (or bad) numbers in the original sequence are represented as Double.NaN.
下面是一个简单的代码片段,它用前 N 个数字的平均值修补双精度类型序列中缺失的数字。原始序列中缺失(或错误)的数字表示为 Double.NaN。

As shown in the code, method ‘patchCurrElem’ is created to prepend the calculated average of the previous N numbers to the supplied list. Its signature fits well to be a function taken by ‘foldLeft’ to traverse the entire sequence for applying the patch. Since ‘patchCurrElem’ prepends the sub-sequence for optimal operations in Scala List, the final list requires a reversal.
如代码所示,创建方法“patchCurrElem”以将前 N 个数字的计算平均值附加到提供的列表之前。它的签名非常适合成为“foldLeft”采用的函数,以遍历应用补丁的整个序列。由于 'patchCurrElem' 在 Scala List 中预置了最佳操作的子序列,因此最终列表需要反转。

Note that ‘lastN.size’ rather than the literal ‘N’ is used to handle cases when there is less than N prior numbers available for average calculation. And ‘case Nil’ will cover cases when there is no prior number.
请注意,“lastN.size”而不是文字“N”用于处理可用于平均计算的先验数字少于 N 的情况。“案例 Nil”将涵盖没有先前编号的情况。

Generalizing the patch method
通用化修补方法

In deriving a generic method for ‘patchAvgLastN’, We’re not going to generalize it for Scala Numeric, as ‘average’ isn’t quite meaningful for non-fractional numbers such as integers. Instead, we’ll generalize it for Scala Fractional, which provides method ‘mkNumericOps’ for access to FractionalOps that consists of division operator (i.e. ‘/’) necessary for average calculation.
在推导 'patchAvgLastN' 的通用方法时,我们不会将其推广到 Scala Numeric 中,因为“average”对于非小数(如整数)不太有意义。相反,我们将它推广到 Scala Fractional 中,它提供了方法“mkNumericOps”来访问由平均计算所需的除法运算符(即“/”)组成的分数操作。

Since we’re no longer handling Double-type number, ‘-999’ is used as the default value (of type Int) to replace Double.Nan as the missing (or bad) number.
由于我们不再处理双精度类型数字,因此“-999”用作默认值(Int 类型)以替换 Double.Nan 作为缺失(或错误)的数字。

Work-arounds for patching integer sequences
修补整数序列的解决方法

A quick work-around for interpolating a list of integers (type Int, Long or BigInt) would be to transform the integers to a Fractional type, apply the patch and transform back to the original type. Note that in some cases, rounding or truncation might occur in the transformations. For example:
插入整数列表(类型Int,Long或BigInt)的快速解决方法是将整数转换为分数类型,应用修补程序并转换回原始类型。请注意,在某些情况下,转换中可能会发生舍入或截断。例如:

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

A Brief Overview Of Scala Futures
斯卡拉期货的简要概述

As demand for computing performance continues to grow, contemporary applications have been increasingly exploiting the collective processing power of all the available CPU cores to maximize task execution in parallel. But writing asynchronous code requires methodical processing control to avoid issues such as race condition and can be quite challenging even for experienced programmers.
随着对计算性能的需求不断增长,当代应用程序越来越多地利用所有可用 CPU 内核的集体处理能力来最大化并行任务执行。但是编写异步代码需要有条不紊的处理控制,以避免竞争条件等问题,即使对于有经验的程序员来说也是相当具有挑战性的。

The Scala Future API provides a comprehensive set of functions for writing concurrent, asynchronous code. By design, Scala is a functional programming language with an inherent emphasis on immutability and composability that help avoid issues like race condition and facilitate successive transformations. In this blog post, we’re going to explore how the Scala Futures benefit from those functional features.
Scala Future API 提供了一组全面的函数,用于编写并发异步代码。根据设计,Scala 是一种函数式编程语言,它固有地强调不变性和可组合性,有助于避免竞争条件等问题并促进连续转换。在这篇博文中,我们将探讨 Scala 期货如何从这些功能特性中受益。

Simulating a CPU-bound task
模拟 CPU 密集型任务

Let’s first prepare a couple of items for upcoming uses:
让我们首先为即将到来的用途准备几个项目:

  1. a Result case class representing result of work with a work id and time spent in milliseconds
    一个 Result 事例类,表示工作 id 的工作结果和花费的时间(以毫秒为单位)
  2. a doWork method which mimics executing some CPU-bound work for a random period of time and returns a Result object
    一个 doWork 方法,它模拟在随机时间段内执行一些 CPU 密集型工作并返回 Result 对象

Note that the side-effecting println within doWork is for illustrating when each of the asynchronously launched tasks is executed.
请注意, doWork 中的副作用 println 用于说明何时执行每个异步启动的任务。

The conventional way of running asynchronous tasks
运行异步任务的传统方式

Using the doWork method, the conventional way of asynchronously running a number of tasks typically involves a configurable thread pool using Java Executor to execute the tasks as individual Runnables.
使用 doWork 方法,异步运行多个任务的传统方法通常涉及使用 Java Executor 的可配置线程池,以将任务作为单个 Runnable 执行。

Despite the (1 to 4) ordering of the tasks, the chronological work result printouts with shuffled work ids shows that they were processed in parallel. It’s worth noting that method run() does not return a value.
尽管任务的顺序(1 到 4),但按时间顺序排列的工作结果打印输出带有随机工作 ID 显示它们是并行处理的。值得注意的是,方法 run() 不返回值。

Using Scala Futures
使用 Scala 期货

By simply wrapping doWork in a Future, each task is now asynchronously executed and results are captured by the onComplete callback method. The callback method takes a Try[T] => U function and can be expanded to handle the success/failure cases accordingly:
通过简单地将 doWork 包装在 Future 中,每个任务现在都是异步执行的,结果由 onComplete 回调方法捕获。回调方法采用 Try[T] => U 函数,可以相应地扩展以处理成功/失败情况:

We’re using the same Executor thread pool which can be configured to optimize for specific computing environment (e.g. number of CPU cores). The implicit ExecutionContext is required for executing callback methods such as onComplete. One could also fall back to Scala’s default ExecutionContext, which is a Fork/Join Executor, by simply importing the following:
我们使用相同的执行器线程池,可以将其配置为针对特定的计算环境(例如 CPU 内核数量)进行优化。隐式 ExecutionContext 是执行回调方法(如 onComplete )所必需的。也可以回退到 Scala 的默认 ExecutionContext,这是一个 Fork/Join Executor ,只需导入以下内容:

However, Scala Futures provide a lot more than just a handy wrapper for executing non-blocking tasks with callback methods.
然而,Scala Futures 提供的不仅仅是一个方便的包装器,用于使用回调方法执行非阻塞任务。

Immutability and composability
不变性和可组合性

Scala Futures involve code running in multiple threads. By adhering to using Scala’s immutable collections, defining values as immutable val (contrary to variables as mutable var), and relying on functional transformations (as opposed to mutations), one can easily write concurrent code that is thread-safe, avoiding problems such as race condition.
Scala Futures涉及在多个线程中运行的代码。通过坚持使用Scala的不可变集合,将值定义为不可变 val (与变量相反为可变 var ),并依靠函数转换(而不是突变),可以轻松编写线程安全的并发代码,避免诸如竞争条件等问题。

But perhaps one of the most sought-after features of Scala Futures is the support of composable transformations in the functional programming way. For example, we can chain a bunch of Futures via methods like map and filter:
但也许 Scala Futures 最受欢迎的功能之一是支持函数式编程方式的可组合转换。例如,我们可以通过 mapfilter 等方法链接一堆期货:

The above snippet asynchronously runs doWork(1), and if finished within 1400 ms, continues to run the next task doWork(2).
上面的代码片段异步运行 doWork(1),如果在 1400 毫秒内完成,则继续运行下一个任务 doWork(2)。

Another example: let’s say we have a number of predefined methods doChainedWork(id, res) with id = 1, 2, 3, …, each taking a work result and deriving a new work result like below:
另一个例子:假设我们有许多预定义的方法 doChainedWork(id, res) ,id = 1, 2, 3, ...,每个方法都获取一个工作结果并派生一个新的工作结果,如下所示:

And let’s say we want to successively apply doChainedWork in a non-blocking fashion. We can simply wrap each of them in a Future and chain them using flatMap:
假设我们想以非阻塞的方式连续应用 doChainedWork 。我们可以简单地将它们中的每一个包装在一个 Future 中,并使用 flatMap 将它们链接起来:

Note that neither of the above trivialized example code handles failures, hence will break upon the first exception. Depending on the specific business logic, that might not be desirable.
请注意,上述琐碎的示例代码都不处理故障,因此将在第一个异常时中断。根据特定的业务逻辑,这可能不可取。

Using map and recover on Futures
使用地图并在期货上恢复

While the onComplete callback in the previous example can handle failures, its Unit return type hinders composability. This section addresses the very issue.
虽然上一示例中的 onComplete 回调可以处理故障,但其 Unit 返回类型阻碍了可组合性。本节将解决这一问题。

In the Future trait, methods map and recover have the following signature:
在 Future 特征中,方法 maprecover 具有以下签名:

When a Future results in success, method map applies the provided function to the result. On the other hand, if the Future results in failure with a Throwable, method recover applies a given partial function that matches the Throwable. Both methods return a new Future.
当 Future 结果成功时,方法 map 会将提供的函数应用于结果。另一方面,如果 Future 导致可抛出对象失败,则方法 recover 将应用与可抛售对象匹配的给定偏函数。这两种方法都会返回一个新的未来。

In the above sample code, we use map to create a Future of Right[Result] when doWork succeeds, and recover to create a Future of Left[Throwable] when doWork fails. Using Either[Throwable, Result[Int]] as the return data type, we capture successful and failed return in a type-safe fashion, allowing composition of any additional transformations.
在上面的示例代码中,我们使用 map 在 doWork 成功时创建 Future of Right[Result],使用 recover 在 doWork 失败时创建 Future of Left[Throwable]。使用 Both[Throwable, Result[Int]] 作为返回数据类型,我们以类型安全的方式捕获成功和失败的返回,允许组合任何其他转换。

Method Await is used to wait for a given duration for the combined Future to complete and return the result.
方法 Await 用于等待给定的持续时间,以便组合的未来完成并返回结果。

From a sequence of Futures to a Future of sequence
从序列的未来到序列的未来

Oftentimes, when faced with a set of Futures each of which consists of values to be consumed, we would prefer wrapping the set of values within a single Future. For that purpose, the Scala Future companion object provides a useful method Future.sequence which converts a sequence of Futures to a Future of sequence.
通常,当面对一组期货时,每个期货都包含要消费的值,我们更愿意将这组价值包装在单个期货中。为此,Scala Future 伴侣对象提供了一个有用的方法 Future.sequence ,该方法将 Futures 序列转换为序列的未来序列。

As shown in the example, a collection of Future[Either[Throwable,Result]] is transformed into a single Future of Either[Throwable,Result] elements.
如示例中所示, Future[Either[Throwable,Result]] 的集合被转换为 Either[Throwable,Result] 元素的单个未来。

Or, we could use method Future.traverse which is a more generalized version of Future.sequence. It allows one to provide a function, f: A => Future[B], as an additional input to be applied to the items in the individual Futures of the input sequence. The following snippet that takes a (1 to 4) range and an Int => Future[Either[Throwable,Result]] function as input carries out the same transformation as the above Future.sequence snippet.
或者,我们可以使用方法 Future.traverse ,它是 Future.sequence 的更通用版本。它允许人们提供一个函数 f: A => Future[B] 作为额外的输入,应用于输入序列的各个期货中的项目。以下代码段采用 (1 to 4) 范围和 Int => Future[Either[Throwable,Result]] 函数作为输入,执行与上述 Future.sequence 代码段相同的转换。

First completed Future
首次完成的未来

If one only cares about the first completed Future out of a list of Futures launched in parallel, the Scala Future companion object provides a handy method Future.firstCompletedOf.
如果只关心并行启动的未来列表中的第一个完成的未来,Scala Future 伴侣对象提供了一个方便的方法 Future.firstCompletedOf

Grasping the “future”
把握“未来”

As the code examples have demonstrated so far, Scala Future is an abstraction which returns a value in the future. A take-away point is that once a Future is “kicked off”, it’s in some sense “untouchable” and its value won’t be available till it’s completed with success or failure. In another blog post some other time, we’ll explore how one can seize a little more control in the “when” and “what” in completing a Future.
正如代码示例到目前为止所展示的那样,Scala Future 是一个在未来返回值的抽象。一个要点是,一旦未来被“启动”,它在某种意义上是“不可触碰的”,它的价值在成功或失败之前将无法获得。在另一篇博文中,我们将探讨如何在完成未来的“何时”和“什么”中获得更多控制权。

3 thoughts on “A Brief Overview Of Scala Futures
关于“Scala 期货概述”的 3 条思考

  1. Pingback: Scala Promises – Futures In Your Hands | Genuine Blog
    pingback: 斯卡拉承诺 – 未来在你手中 |正版博客

  2. Pingback: Traversing A Scala Collection | Genuine Blog
    Pingback: 穿越 Scala 集合 |正版博客

  3. Pingback: NIO-based Reactor in Scala - Genuine Blog
    Pingback:斯卡拉中基于NIO的反应堆 - 真正的博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala Promises – Futures In Your Hands
斯卡拉承诺 – 未来在你手中

In the previous blog post, we saw how Scala Futures serve as a handy wrapper for running asynchronous tasks and allow non-blocking functional transformations via composable functions. Despite all the goodies, a plain Future, once started, is read-only.
在之前的博客文章中,我们看到了 Scala Futures 如何作为一个方便的包装器来运行异步任务,并允许通过可组合函数进行非阻塞函数转换。尽管有这么多好处,但一个普通的未来,一旦开始,就是只读的。

A “manipulable” Future
“可操纵”的未来

To make things a little more interesting, let’s take a glimpse into an interesting “container” that holds an “uncertain” Future. Scala provides another abstraction called Promise that allows programmers to have some control in the “when” and “what” in completing a Future. A Promise is like a container holding a Future which can be completed by assigning a value (with success or failure) at any point of time. The catch is that it can only be completed once.
为了让事情变得更有趣,让我们瞥见一个有趣的“容器”,它有一个“不确定”的未来。Scala提供了另一个称为Promise的抽象,它允许程序员在完成Future的“何时”和“什么”方面有一定的控制权。承诺就像一个装有未来的容器,可以通过在任何时间点分配一个值(成功或失败)来完成。问题是它只能完成一次。

The Promise companion object has the following apply method that creates a DefaultPromise:
承诺伴随对象具有以下 apply 方法,用于创建 DefaultPromise

As shown below, the DefaultPromise class extends AtomicReference to ensure that a Promise instance will be completed in an atomic fashion.
如下所示,DefaultPromise 类扩展了 AtomicReference 以确保 Promise 实例将以原子方式完成。

A trivial producer and consumer
微不足道的生产者和消费者

A common use case of Promise is like this:
乔鼎的一个常见用例是这样的:

  1. a Promise which holds an “open” Future is created
    一个拥有“开放”未来的承诺被创造出来
  2. run some business logic to come up with some value
    运行一些业务逻辑以产生一些价值
  3. complete the Promise by assigning its Future the value via methods like success(), failure(), tryComplete(), etc
    通过 success()failure()tryComplete() 等方法为其未来赋值来完成承诺
  4. return the “closed” Future
    返回“封闭”的未来

Here’s a hello-world example of Scala Promise used in a trivialized producer and consumer:
下面是一个 Scala Promise 用于琐碎的生产者和消费者的 hello world 示例:

The above code snippet is rather self-explanatory. The producer running in one thread completes the Promise’s future based on the result of a randomly generated integer and the consumer in another thread checks and reports the value of the completed future.
上面的代码片段是不言自明的。在一个线程中运行的生产者根据随机生成的整数的结果完成 Promise 的未来,另一个线程中的使用者检查并报告已完成的未来值。

Simulating a CPU-bound task, again
再次模拟 CPU 密集型任务

For up-coming illustrations, let’s borrow the CPU-bound task simulation doWork method used in the coding examples from the previous blog post:
对于即将到来的插图,让我们借用上一篇博客文章中编码示例中使用的 CPU 密集型任务模拟 doWork 方法:

Revisiting first completed Future
重温第一个完成的未来

Recall that method Future.firstCompletedOf from the previous blog post can be used to capture the first completed Future out of a list of Futures running in parallel:
回想一下,上一篇博客文章中的方法 Future.firstCompletedOf 可用于从并行运行的期货列表中捕获第一个完成的未来:

Now, let’s see how firstCompletedOf is actually implemented in Scala Future using Promise:
现在,让我们看看 firstCompletedOf 实际上是如何使用 Promise 在 Scala Future 中实现的:

In the firstCompletedOf method implementation, the helper callback function firstCompleteHandler for each of the Futures in the input list ensures by means of an AtomicReference that the first completed Future will be the Promise’s future.
firstCompletedOf 方法实现中,输入列表中每个 Future 的帮助程序回调函数 firstCompleteHandler 通过 AtomicReference 确保第一个完成的未来将是 Promise 的未来。

First completed Future with a condition
首次完成的未来,有条件

What if we want to get the first completed Future from a number of Futures whose values meet a certain condition? One approach would be to derive from the firstCompletedOf method implementation.
如果我们想从许多价值满足特定条件的期货中获得第一个完成的未来怎么办?一种方法是从 firstCompletedOf 方法实现派生。

We pick the default ExecutionContext like how we did in some coding examples from the previous blog. Besides the list of Futures, the derived method firstConditionallyCompletedOf[T] would also take a T => Boolean filtering condition as a parameter. Piggybacking on the core logic from method firstCompletedOf, we simply apply the input filter to each of the Futures in the input list before the callback.
我们选择默认的 ExecutionContext,就像我们在上一篇博客中的一些编码示例中所做的那样。除了期货列表之外,派生方法 firstConditionallyCompletedOf[T] 还将采用 T => Boolean 过滤条件作为参数。捎带方法 firstCompletedOf 的核心逻辑,我们只需在回调之前将输入过滤器应用于输入列表中的每个期货。

First N completed Futures
前N个完成的期货

While at it, rather than just the first completed Future, what if we want to capture the first few completed Futures? Deriving from the firstCompletedOf implementation wouldn’t quite work – the way the helper callback function firstCompleteHandler is structured wouldn’t be useful now that we have a list of Futures to be captured.
在这里,不仅仅是第一个完成的未来,如果我们想捕捉前几个完成的未来呢?从 firstCompletedOf 实现派生不会完全有效 – 帮助程序回调函数 firstCompleteHandler 的结构方式将没有用,因为我们有一个要捕获的期货列表。

We’ll take a straight forward approach of using a var list for capturing the first N (or the size of input Futures, whichever smaller) Future results and update the list inside a synchronized block. Since we want to capture the first few completed Futures (success or failure), we make the return Future consisting of a List[Either[Throwable, T]], rather than just List[T].
我们将采用一种直接的方法,使用 var 列表来捕获前 N(或输入期货的大小,以较小者为准)未来结果,并在 synchronized 块内更新列表。由于我们想要捕获前几个完成的未来(成功或失败),因此我们使返回的未来由 List[Either[Throwable, T]] 组成,而不仅仅是 List[T]

Simulating a non-CPU-bound task
模拟非 CPU 密集型任务

Rather than keeping the CPU busy (thus CPU-bound), a non-CPU-bound asynchronous task does not demand extensive processing resource. The following snippet defines a method that mimics a non-CPU-bound asynchronous task which could be, say, a non-blocking call to a remote database. This time, we’re going to run on an Akka Actor system, using the ExecutionContext that comes with its default dispatcher. Besides the Fork/Join Executor provided by the dispatcher, we pick the Akka runtime library also to leverage its high-throughput scheduler.
非 CPU 密集型异步任务不需要大量的处理资源,而不是让 CPU 保持忙碌(因此 CPU 受限)。以下代码片段定义了一个模拟非 CPU 密集型异步任务的方法,例如,对远程数据库的非阻塞调用。这一次,我们将运行在 Akka Actor 系统 ,使用其默认调度程序附带的 ExecutionContext .除了调度程序提供的 Fork/Join 执行器之外,我们还选择 Akka 运行时库来利用其高吞吐量调度程序。

In this example, a Promise which contains a Future is created and after a random duration, the scheduler triggers the completion of the Future with success or failure depending on the random time.
在此示例中,将创建一个包含 Future 的 Promise,在随机持续时间后,调度程序会根据随机时间触发 Future 的完成,成功或失败。

Launching method nonCPUbound() with some value a few times would yield results similar to the following:
多次启动具有某些值的方法 nonCPUbound() 将产生类似于以下内容的结果:

CPU-bound versus non-CPU-bound tasks
CPU 密集型任务与非 CPU 密集型任务

By wrapping a CPU-bound task like doWork() with a Future, the task becomes non-blocking but it still consumes processing power. The default ExecutionContext via the scala.concurrent.ExecutionContext.Implicits.global import will optimally set scala.concurrent.context.maxThreads to the number of CPU cores of the machine the application resides on. One can raise the maxThreads and handcraft a custom ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numOfThreads)) to allow more threads to be run. To set the value of maxThreads to, say 16, simply add the following javaOptions to build.sbt.
通过将 CPU 密集型任务(如 doWork() )与 Future 包装在一起,该任务将变为非阻塞,但仍会消耗处理能力。通过 scala.concurrent.ExecutionContext.Implicits.global 导入的默认 ExecutionContext 将以最佳方式将 scala.concurrent.context.maxThreads 设置为应用程序所在的计算机的 CPU 内核数。可以提高 maxThreads 并手动创建自定义 ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numOfThreads)) 以允许运行更多线程。要将 maxThreads 的值设置为 16,只需将以下 javaOptions 添加到 build.sbt 中即可。

However, that wouldn’t necessarily make more instances of Future{ doWork() } than the number of CPU cores execute in parallel since each of them consumes CPU resource while executing.
但是,这不一定会使 Future{ doWork() } 的实例数超过并行执行的 CPU 内核数,因为它们中的每一个在执行时都会消耗 CPU 资源。

On the other hand, a non-CPU-bound task like nonCPUbound() takes little CPU resource. In this case, configuring an ExecutionContext with more threads than the CPU cores of the local machine can increase performance, since none of the individual tasks would consume anywhere near the full capacity of a CPU core. It’s not uncommon to configure a pool of hundreds of threads to handle a large amount of such tasks on a machine with just a handful of CPU cores.
另一方面,像 nonCPUbound() 这样的非 CPU 密集型任务占用的 CPU 资源很少。在这种情况下,配置具有比本地计算机的 CPU 内核更多的线程的 ExecutionContext 可以提高性能,因为任何单个任务都不会消耗接近 CPU 内核的全部容量。在只有少数 CPU 内核的计算机上配置数百个线程的池来处理大量此类任务的情况并不少见。

Futures or Promises?
期货还是承诺?

While the Scala Future API extensively utilizes Promises in its function implementations, we don’t need to explicitly use Promises very often as the Futures API already delivers a suite of common concurrent features for writing asynchronous code. If the business logic doesn’t need Promises, just stick to the Futures API. But for cases in which you need to provide a “contract to be fulfilled at most once in the future”, say, between two modules like the producer/consumer example above, Promises do come in handy.
虽然 Scala Future API 在其函数实现中广泛使用 Promises,但我们不需要经常显式使用 Promises,因为 Futures API 已经提供了一套用于编写异步代码的常见并发功能。如果业务逻辑不需要承诺,只需坚持使用期货 API。但是对于您需要提供“将来最多履行一次合同”的情况,例如,在上面的 producer/consumer 示例等两个模块之间,承诺确实派上用场。

1 thought on “Scala Promises – Futures In Your Hands
关于“斯卡拉承诺 – 你手中的未来”的 1 条思考

  1. Emma Y
    艾玛· 十月21,2021 6时:下午01

    Scala Promises – Futures In Your Hands – that’s why I want to learn Scala. because it promise a future for me . Thanks for sharing.
    Scala Promise – Futures In Your Hands(英语:Futures in Your Hands)——这就是我想学习 Scala 的原因。因为它为我承诺了一个未来.感谢分享。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Spark – Interpolating Time Series Data
Spark – 插值时间序列数据

Like many software professionals, I often resort to Stack Overflow (through search engines) when looking for clues to solve programming problems at hand. Besides looking for programming solution ideas, when I find some free time I would also visit the site to help answer questions posted there. Occasionally, some of the more interesting question topics would trigger me to expand them into blog posts. In fact, a good chunk of the content in my Scala-on-Spark (SoS) mini blog series was from selected Stack Overflow questions I’ve answered in the past. So is the topic being tackled in this blog post.
像许多软件专业人士一样,在寻找解决手头编程问题的线索时,我经常求助于Stack Overflow(通过搜索引擎)。除了寻找编程解决方案的想法外,当我找到一些空闲时间时,我还会访问该网站以帮助回答那里发布的问题。偶尔,一些更有趣的问题主题会触发我将它们扩展到博客文章中。事实上,我的 Scala-on-Spark (SoS) 迷你博客系列中的很大一部分内容来自我过去回答过的精选 Stack Overflow 问题。这篇博文中要解决的主题也是如此。

Suppose we have time-series data with time gaps among the chronological timestamps like below:
假设我们有时间序列数据,在时间顺序时间戳之间具有时间间隔,如下所示:

Our goal is to expand column timestamp into per-minute timestamps and column amount into linearly interpolated values like below:
我们的目标是将列 timestamp 扩展为每分钟时间戳,将列 amount 扩展为线性插值,如下所示:

There are different ways to solve interpolation problems. Since timestamps can be represented as Long values (i.e. Unix time), it might make sense to consider using method spark.range to create a time series of contiguous timestamps and left-join with the dataset at hand. The catch, though, is that the method applies to the entire dataset (as opposed to per-group) and requires the start and end of the timestamp range as its parameters that might not be known in advance.
有不同的方法可以解决插值问题。由于时间戳可以表示为 Long 值(即 Unix 时间),因此考虑使用 spark.range 方法创建连续时间戳的时间序列并与手头的数据集进行左连接可能是有意义的。但是,问题在于该方法适用于整个数据集(而不是每个组),并且需要时间戳范围的 startend 作为其可能事先不知道的参数。

A more flexible approach would be to use a UDF (user-defined function) for custom data manipulation, though at the expense of potential performance degradation (since built-in Spark APIs that leverage Spark’s optimization engine generally scale better than UDFs). For more details about native functions versus UDFs in Spark, check out this blog post.
更灵活的方法是使用 UDF(用户定义函数)进行自定义数据操作,但代价是潜在的性能下降(因为利用 Spark 优化引擎的内置 Spark API 通常比 UDF 扩展得更好)。有关 Spark 中的本机函数与 UDF 的更多详细信息,请查看此博客文章。

Nevertheless, the solution being proposed here involves using a UDF which, for each row, takes values of timestamp and amount in both the current row and previous row as parameters, and returns a list of interpolated (timestamp, amount) Tuples. Using the java.time API, the previous and current String-type timestamps will be converted into a LocalDateTime range to be linearly interpolated.
然而,这里提出的解决方案涉及使用UDF,对于每一行,将当前行和上一行中的 timestampamount 值作为参数,并返回插值 (timestamp, amount) 元组的列表。使用 java.time API 时,以前的和当前的字符串类型的时间戳将被转换为要线性插值的 LocalDateTime 范围。

Note that Iterator.iterate(init)(next).takeWhile(condition) in the UDF is just a functional version of the conventional while-loop.
请注意,UDF 中的 Iterator.iterate(init)(next).takeWhile(condition) 只是常规 while-loop 的功能版本。

With the UDF in place, we provide the function the timestamp pattern along with the previous/current timestamp pair and previous/current amount pair to produce a list of interpolated timestamp-amount pairs. The output will then be flattened using Spark’s built-in explode function.
有了 UDF,我们为函数提供了时间戳模式以及上一个/当前时间戳对和上一个/当前金额对,以生成插值 timestamp-amount 对的列表。然后,输出将使用 Spark 的内置 explode 函数进行 flattened

1 thought on “Spark – Interpolating Time Series Data
关于“Spark – 插值时间序列数据”的 1 条思考

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Spark – Time Series Sessions
Spark – 时序会话

When analyzing time series activity data, it’s often useful to group the chronological activities into “target”-based sessions. These “targets” could be products, events, web pages, etc.
在分析时序活动数据时,将按时间顺序的活动分组到基于“目标”的会话中通常很有用。这些“目标”可以是产品、事件、网页等。

Using a simplified time series log of web page activities, we’ll look at how web page-based sessions can be created in this blog post.
使用网页活动的简化时间序列日志,我们将在此博客文章中了解如何创建基于网页的会话。

Let’s say we have a log of chronological web page activities as shown below:
假设我们有一个按时间顺序排列的网页活动的日志,如下所示:

And let’s say we want to group the log data by web page to generate user-defined sessions with format userID-#, where # is a monotonically increasing integer, like below:
假设我们想按网页对日志数据进行分组,以生成格式为 userID-# 的用户定义会话,其中 # 是一个单调递增的整数,如下所示:

The first thing that pops up in one’s mind might be to perform a groupBy(user, page) or a Window partitionBy(user, page). But that wouldn’t work since doing so would disregard time gaps between the same page, resulting in all rows with the same page grouped together under a given user.
一个人脑海中浮现的第一件事可能是执行 groupBy(user, page) 或窗口 partitionBy(user, page) 。但这行不通,因为这样做会忽略同一页面之间的时间间隔,从而导致具有相同页面的所有行在给定用户下分组在一起。

First thing first, let’s assemble a DataFrame with some sample web page activity data:
首先,让我们用一些示例网页活动数据组装一个数据帧:

The solution to be presented here involves a few steps:
这里要介绍的解决方案涉及几个步骤:

  1. Generate a new column first_ts which, for each user, has the value of timestamp in the current row if the page value is different from that in the previous row; otherwise null.
    生成一个新列 first_ts ,如果 page 的值与上一行中的值不同,则对于每个用户,当前行中的值为 timestamp ;否则 null
  2. Backfill all the nulls in first_ts with the last non-null value via Window function last() and store in the new column sess_ts.
    通过窗口函数 last() 使用 last 非空值回填 first_ts 中的所有 null s,并存储在新列 sess_ts 中。
  3. Assemble session IDs by concatenating user and the dense_rank of sess_ts within each user partition.
    通过在每个 user 分区中连接 usersess_tsdense_rank 来组合会话 ID。

Note that the final output includes all the intermediate columns (i.e. first_ts and sess_ts) for demonstration purpose.
请注意,最终输出包括所有中间列(即 first_tssess_ts ) 用于演示目的。

3 thoughts on “Spark – Time Series Sessions
关于“火花 – 时间序列会议”的 3 条思考

  1. Pingback: Spark – Custom Timeout Sessions | Genuine Blog
    回调:火花 – 自定义超时会话 |正版博客

  2. deeksha
    迪克沙 六月3,2020 10在:上午30

    I think 6th line should be
    我认为第 6 行应该是

    val winUserSess = Window.partitionBy($”user”).orderBy(“timestamp”) instead of
    val winUserSess = Window.partitionBy($“user”).orderBy(“timestamp”) 而不是

    val winUserSess = Window.partitionBy($”user”).orderBy(“sess_ts”)
    val winUserSess = Window.partitionBy($“user”).orderBy(“sess_ts”)

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 六月3,2020 2时:下午14

      Thanks for the comment, deeksha. We want the generated session_id values to correspond to the sess_ts values such that rows for a given user with same sess_ts have same session_id. Ordering by sess_ts in the 2nd Window spec is specifically for Window function dense_rank to generate the session_id as ${user}-${rank} to fulfill the requirement, whereas ordering by timestamp would generate distinct ranks for the same sess_ts.
      谢谢你的评论,迪克沙。我们希望生成的 session_id 值对应于 sess_ts 值,以便具有相同 sess_ts 的给定用户的行具有相同的 session_id 。在第二个窗口规范中按 sess_ts 排序专门用于窗口函数 dense_rank 生成 session_id 作为 ${user}-${rank} 以满足要求,而按 timestamp 排序将为相同的 sess_ts 生成不同的排名。

      Reply
      回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Spark – Custom Timeout Sessions
Spark – 自定义超时会话

In the previous blog post, we saw how one could partition a time series log of web activities into web page-based sessions. Operating on the same original dataset, we’re going to generate sessions based on a different set of rules.
在之前的博客文章中,我们看到了如何将Web活动的时间序列日志划分为基于网页的会话。在相同的原始数据集上运行,我们将基于一组不同的规则生成会话。

Rather than web page-based, sessions are defined with the following rules:
会话不是基于网页的,而是使用以下规则定义的:

  1. A session expires after inactivity of a timeout period (say tmo1), and,
    会话在超时期限(例如 tmo1 )不活动后过期,并且,
  2. An active session expires after a timeout period (say tmo2).
    活动会话在超时期限后过期(例如 tmo2 )。

First, we assemble the original sample dataset used in the previous blog:
首先,我们组装上一篇博客中使用的原始示例数据集:

Let’s set the first timeout tmo1 to 15 minutes, and the second timeout tmo2 to 60 minutes.
让我们将第一次超时 tmo1 设置为 15 分钟,将第二次超时 tmo2 设置为 60 分钟。

The end result should look something like below:
最终结果应如下所示:

Given the above session creation rules, it’s obvious that all programming logic is going to be centered around the timestamp alone, hence the omission of columns like page in the expected final result.
鉴于上述会话创建规则,很明显,所有编程逻辑都将仅以时间戳为中心,因此在预期的最终结果中省略了像 page 这样的列。

Generating sessions based on rule #1 is rather straight forward as computing the timestamp difference between consecutive rows is easy with Spark built-in Window functions. As for session creation rule #2, it requires dynamically identifying the start of the next session that depends on where the current session ends. Hence, even robust Window functions over, say, partitionBy(user).orderBy(timestamp).rangeBetween(0, tmo2) wouldn’t cut it.
基于规则 #1 生成会话相当简单,因为使用 Spark 内置的 Window 函数可以轻松计算连续行之间的时间戳差异。至于会话创建规则 #2,它需要动态识别下一个会话的开始,具体取决于当前会话的结束位置。因此,即使是健壮的 Window 函数,比如 partitionBy(user).orderBy(timestamp).rangeBetween(0, tmo2) 也不会削减它。

The solution to be suggested involves using a UDF (user-defined fucntion) to leverage Scala’s feature-rich set of functions:
建议的解决方案涉及使用 UDF(用户定义的功能)来利用 Scala 功能丰富的函数集:

Note that the timestamp diff list tsDiffs is the main input being processed for generating sessions based on the tmo2 value (session create rule #2). The timestamp list tsList is being “passed thru” merely to be included in the output with each timestamp paired with the corresponding session ID.
请注意,时间戳差异列表 tsDiffs 是正在处理的主要输入,用于基于 tmo2 值生成会话(会话创建规则 #2)。时间戳列表 tsList 被“传递”只是为了包含在输出中,每个时间戳都与相应的会话 ID 配对。

Also note that the accumulator for foldLeft in the UDF is a Tuple of (ls, j, k), where:
另请注意,UDF 中 foldLeft 的累加器是 (ls, j, k) 的元组,其中:

  • ls is the list of formatted session IDs to be returned
    ls 是要返回的格式化会话 ID 的列表
  • j and k are for carrying over the conditionally changing timestamp value and session id number, respectively, to the next iteration
    jk 分别用于将有条件更改的时间戳值和会话 ID 号转移到下一次迭代

Now, let’s lay out the steps for carrying out the necessary transformations to generate the sessions:
现在,让我们列出执行必要转换以生成会话的步骤:

  1. Identify sessions (with 0 = start of a session) per user based on session creation rule #1
    根据会话创建规则 #1 确定每个用户的会话(0 = 会话开始)
  2. Group the dataset to assemble the timestamp diff list per user
    对数据集进行分组以组合每个用户的时间戳差异列表
  3. Process the timestamp diff list via the above UDF to identify sessions based on rule #2 and generate all session IDs per user
    通过上述 UDF 处理时间戳差异列表,以根据规则 #2 识别会话并为每个用户生成所有会话 ID
  4. Expand the processed dataset which consists of the timestamp paired with the corresponding session IDs
    展开已处理的数据集,该数据集由与相应会话 ID 配对的时间戳组成

Step 1:

Steps 2-4:

2 thoughts on “Spark – Custom Timeout Sessions
关于“Spark – 自定义超时会话”的 2 条思考

  1. Anjali Jain
    安贾利耆那教 七月24,2020 4在:上午15

    is it possible to do this in pyspark?
    在 PySpark 中可以做到这一点吗?

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 七月24,2020 8时:下午13

      Thanks for the comment. I’m no Python guru, but am pretty sure it’s possible to do something similar in PySpark. The foldLeft Scala method in the UDF is probably the only code portion that needs an implementation in some different way, like using some conditional iteration logic in an imperative fashion.
      感谢您的评论。我不是Python大师,但我很确定在PySpark中可以做类似的事情。UDF 中的 foldLeft Scala 方法可能是唯一需要以某种不同方式实现的代码部分,例如以命令式方式使用某些条件迭代逻辑。

      Reply
      回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Fibonacci In Scala: Tailrec, Memoized
斐波那契量表:尾巴,记忆

One of the most popular number series being used as a programming exercise is undoubtedly the Fibonacci numbers:
用作编程练习的最流行的数字序列之一无疑是斐波那契数:

Perhaps a prominent reason why the Fibonacci sequence is of vast interest in Math is the associated Golden Ratio, but I think what makes it a great programming exercise is that despite a simplistic definition, the sequence’s exponential growth rate presents challenges in implementations with space/time efficiency in mind.
也许斐波那契数列对数学产生广泛兴趣的一个突出原因是相关的黄金比例,但我认为使它成为一个伟大的编程练习的原因是,尽管定义简单,但序列的指数增长率在实现中提出了挑战空间/时间效率。

Having seen various ways of implementing methods for the Fibonacci numbers, I thought it might be worth putting them together, from a naive implementation to something more space/time efficient. But first, let’s take a quick look at the computational complexity of Fibonacci.
在看到了斐波那契数实现方法的各种方法之后,我认为将它们放在一起可能是值得的,从朴素的实现到更具空间/时间效率的东西。但首先,让我们快速浏览一下斐波那契的计算复杂性。

Fibonacci complexity
斐波那契复杂性

If we denote T(n) as the time required to compute F(n), by definition:
如果我们将 T(n) 表示为计算 F(n) 所需的时间,根据定义:

where K is the time taken by some simple arithmetic to arrive at F(n) from F(n-1) and F(n-2).
其中 K 是一些简单算术从 F(n-1) 和 F(n-2) 得出 F(n) 所花费的时间。

With some approximation Math analysis (see this post), it can be shown that the lower bound and upper bound of T(n) are O(2^(n/2)) and O(2^n), respectively. For better precision, one can derive a more exact time complexity by solving the associated characteristic equation, x^2 = x + 1, which yields x = ~1.618 to deduce that:
通过一些近似数学分析(见这篇文章),可以证明T(n)的下界和上限分别是O(2^(n/2))和O(2^n)。为了获得更好的精度,可以通过求解相关的特征方程 x^2 = x + 1 来推导出更精确的时间复杂度,该方程产生 x = ~1.618 来推导出:

where R = ~1.618 is the Golden Ratio.
其中 R = ~1.618 是黄金比例。

As for space complexity, if one looks at the recursive tree for computing F(n), it’s pretty clear that its depth is F(n-1)’s tree depth plus one. Thus, the required space for F(n) is proportional to n. In other words:
至于空间复杂度,如果看一下计算F(n)的递归树,很明显它的深度是F(n-1)的树深度加1。因此,F(n) 所需的空间与 n 成正比。换句话说:

The relatively small space complexity compared with the exponential time complexity explains why computing a Fibonacci number too large for a computer would generally lead to an infinite run rather than a out-of-memory/stack overflow problem.
与指数时间复杂度相比,相对较小的空间复杂度解释了为什么计算斐波那契数对于计算机来说太大通常会导致无限运行,而不是内存不足/堆栈溢出问题。

It’s worth noting, though, if F(n) is computed via conventional iterations (e.g. a while-loop or tail recursion which gets translated into iterations by Scala under the hood), the time complexity would be reduced to O(n) proportional to the number of the loop cycles. And the space complexity would be O(1) since no n-dependent extra space is needed other than that for storing the Fibonacci sequence.
值得注意的是,如果F(n)是通过传统的迭代计算的(例如,while循环或尾递归,由Scala在后台转换为迭代),则时间复杂度将降低到与循环周期数成比例的O(n)。空间复杂度为 O(1),因为除了存储斐波那契数列的空间外,不需要 n 依赖的额外空间。

Naive Fibonacci

To generate Fibonacci numbers, the most straight forward approach is via a basic recursive function like below:
要生成斐波那契数,最直接的方法是通过基本的递归函数,如下所示:

With such a naive recursive function, computing the 50th number, i.e. fib(50), would take minutes on a typical laptop, and attempts to compute any number higher up like fib(90) would most certainly lead to an infinite run.
使用这样的 naive 递归函数,在典型的笔记本电脑上计算第 50 个数字(即 fib(50))需要几分钟,并且尝试计算任何更高的数字(如 fib(90) 肯定会导致无限运行。

Tail recursive Fibonacci
尾递归斐波那契

So, let’s come up with a tail recursive method:
所以,让我们想出一个尾递归方法:

As shown above, tail recursion is accomplished by means of a couple of accumulators as parameters for the inner method to recursively carry over the two numbers that precede the current number.
如上所示,尾递归是通过几个累加器作为内部方法的参数来完成的,以递归方式继承当前数字前面的两个数字。

With the Fibonacci TailRec version, computing, say, the 90th number would finish instantaneously.
使用斐波那契 TailRec 版本,计算,比如说,第90个数字将立即完成。

Fibonacci in a Scala Stream
斯卡拉流中的斐波那契

Another way of implementing Fibonacci is to define the sequence to be stored in a “lazy” collection, such as a Scala Stream:
实现斐波那契的另一种方法是定义要存储在“惰性”集合中的序列,例如 Scala Stream:

Using method scan, scan(1)(_ + _) generates a Stream with each of its elements being successively assigned the sum of the previous two elements. Since Streams are “lazy”, none of the element values in the defined fibStream will be evaluated until the element is being requested.
使用方法 scan , scan(1)(_ + _) 生成一个 Stream,其中每个元素被连续分配前两个元素的总和。由于 Streams 是“惰性的”,因此在请求元素之前,不会计算定义的 fibStream 中的任何元素值。

While at it, there is a couple of other commonly seen Fibonacci implementation variants with Scala Stream:
在它的时候,还有其他一些常见的斐波那契实现变体与Scala Stream:

Scala Stream memoizes by design
Scala Stream 通过设计记忆

These Stream-based Fibonacci implementations perform reasonably well, somewhat comparable to the tail recursive Fibonacci. But while these Stream implementations all involve recursion, none is tail recursive. So, why doesn’t it suffer the same performance issue like the naive Fibonacci implementation does? The short answer is memoization.
这些基于流的斐波那契实现表现相当不错,有点类似于尾递归斐波那契。但是,虽然这些 Stream 实现都涉及递归,但没有一个是尾递归的。那么,为什么它不像 naive 斐波那契实现那样遇到同样的性能问题呢?简短的回答是记忆。

Digging into the source code of Scala Stream would reveal that method #:: (which is wrapped in class ConsWrapper) is defined as:
深入研究Scala Stream的源代码会发现方法 #:: (包装在类ConsWrapper中)被定义为:

Tracing method cons further reveals that the Stream tail is a by-name parameter to class Cons, thus ensuring that the concatenation is performed lazily:
跟踪方法 cons 进一步揭示了 Stream tail 是类 Cons 的按名称参数,从而确保串联是延迟执行的:

But lazy evaluation via by-name parameter does nothing to memoization. Digging deeper into the source code, one would see that Stream content is iterated through a StreamIterator class defined as follows:
但是通过 byname 参数的惰性评估对记忆没有任何帮助。深入研究源代码,可以看到 Stream 内容是通过定义如下的 StreamIterator 类迭代的:

The inner class LazyCell not only has a by-name parameter but, more importantly, makes the Stream represented by the StreamIterator instance a lazy val which, by nature, enables memoization by caching the value upon the first (and only first) evaluation.
内部类 LazyCell 不仅具有按名称参数,而且更重要的是,使 StreamIterator 实例表示的 Stream 成为 lazy val ,从本质上讲,它通过在第一次(也是唯一的第一次)评估时缓存值来实现记忆。

Memoized Fibonacci using a mutable Map
使用可变映射记忆斐波那契

While using a Scala Stream to implement Fibonacci would automatically leverage memoization, one could also explicitly employ the very feature without Streams. For instance, by leveraging method getOrElseUpdate in a mutable Map, a memoize function can be defined as follows:
虽然使用 Scala Stream 来实现斐波那契会自动利用记忆,但也可以在没有 Streams 的情况下显式使用该功能。例如,通过在可变 Map 中利用方法 getOrElseUpdate,可以按如下方式定义 memoize 函数:

For example, the naive Fibonacci equipped with memoization via this memoize function would instantly become a much more efficient implementation:
例如,通过这个 memoize 函数配备记忆的 naive 斐波那契将立即成为一个更有效的实现:

For the tail recursive Fibonacci fibTR, this memoize function wouldn’t be applicable as its inner function fibFcn takes accumulators as additional parameters. As for the Stream-based fibS which is already equipped with Stream’s memoization, applying memoize wouldn’t produce any significant performance gain.
对于尾递归斐波那契 fibTR ,这个 memoize 函数不适用,因为它的内部函数 fibFcn 将累加器作为附加参数。至于已经配备了Stream记忆的基于Stream的 fibS ,应用 memoize 不会产生任何显着的性能提升。

2 thoughts on “Fibonacci In Scala: Tailrec, Memoized
关于“斯卡拉中的斐波那契:尾巴,记忆”的 2 条思考

  1. Pingback: Scala Unfold | Genuine Blog

  2. Pingback: Trampolining with Scala TailCalls - Genuine Blog
    Pingback: Trampolining with Scala TailCalls - Genuine Blog

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Dynamic Pub-Sub Service
阿卡动态发布-订阅服务

As shown in a previous blog post illustrating with a simple text mining application, Akka Stream provides a robust GraphDSL domain specific language for assembling stream graphs using a mix of fan-in/fan-out junctions. While GraphDSL is a great tool, the supported fan-in/fan-out components are limited to have a fixed number of inputs and outputs as all connections of the graph must be known and connected upfront.
如之前的博客文章所示,使用简单的文本挖掘应用程序进行说明,Akka Stream 提供了一种强大的 GraphDSL 域特定语言,用于使用扇入/扇出交汇点的混合组合流图。虽然 GraphDSL 是一个很好的工具,但支持的扇入/扇出组件仅限于具有固定数量的输入和输出,因为图形的所有连接都必须已知并预先连接。

To build a streaming service that allows new producers and consumers to be dynamically added, one would need to look outside of the GraphDSL. In this blog post, we’re going to look at how to build a dynamic publish-subscribe service.
要构建允许动态添加新生产者和消费者的流媒体服务,需要查看 GraphDSL 之外。在这篇博文中,我们将介绍如何构建动态发布-订阅服务。

MergeHub

Akka provides MergeHub that serves as a dynamic fan-in junction in the form of a Source to be attached to with a single consumer. Once materialized, multiple producers can be attached to the hub where elements coming from these producers are emitted in a first-comes-first-served fashion along with backpressure support.
Akka 提供 MergeHub,它以源的形式充当动态扇入连接,与单个使用者连接。具体化后,可以将多个生产者附加到集线器,来自这些生产者的元素以先到先得的方式发出,并带有 backpressure 支持。

MergeHub.source has the following method signature:
MergeHub.source 具有以下方法签名:

Example:

BroadcastHub

BroadcastHub, on the other hand, serves as a dynamic fan-out junction in the form of a Sink to be attached to with a single producer. Similarly, once materialized, multiple consumers can be attached to it, again, with backpressure support.
另一方面,广播中心以接收器的形式充当动态扇出连接点,以连接到单个生产者。同样,一旦具体化,就可以在 backpressure 支持下再次将多个使用者附加到它。

BroadcastHub.sink has the following method signature:
BroadcastHub.sink 具有以下方法签名:

Example:

It should be cautioned that if the source has fewer elements than the bufferSize specified in BroadcastHub.sink, none of the elements will be consumed by the attached consumers. It took me a while to realize it’s fromProducer that “silently” consumes the elements when materialized before the attached consumers have a chance to consume them. That, to me, is really an undocumented “bug”. Using alsoToMat as shown below, one can uncover the seemingly “missing” elements in such case:
应该注意的是,如果源的元素少于 BroadcastHub.sink 中指定的 bufferSize ,则附加的使用者不会使用任何元素。我花了一段时间才意识到,在附加的消费者有机会消费元素之前,“默默地”消费元素是 fromProducer 。对我来说,这确实是一个未记录的“错误”。使用如下所示的 alsoToMat ,在这种情况下可以发现看似“缺失”的元素:

MergeHub + BroadcastHub
合并中心 + 广播中心

By connecting a MergeHub with a BroadcastHub, one can create a dynamic publish-subscribe “channel” in the form of a Flow via Flow.fromSinkAndSource:
通过将 MergeHubBroadcastHub 连接,可以通过 Flow.fromSinkAndSource 以 Flow 的形式创建一个动态的发布-订阅“通道”:

Note that Keep.both in the above snippet produces a Tuple of materialized values (Sink[T, NotUsed], Source[T, NotUsed]) from MergeHub.source[T] and BroadcastHub.sink[T]. The pub-sub channel psChannel can be illustrated as follows:
请注意,上述代码片段中的 Keep.both 会生成来自 MergeHub.source[T]BroadcastHub.sink[T] 的具体化值 (Sink[T, NotUsed], Source[T, NotUsed]) 的元组。发布-订阅频道 psChannel 可以如下所示:

Below is sample code for a simple pub-sub channel psChannel:
下面是一个简单的发布-订阅频道 psChannel 的示例代码:

Serving as a pub-sub channel, the input of psChannel is published via psSink to all subscribers while its output streams through psSource all the elements published. For example:
作为发布-订阅通道, psChannel 的输入通过 psSink 发布给所有订阅者,而其输出通过 psSource 流向所有发布的元素。例如:

Running psChannel as a Flow:
psChannel 作为 Flow 运行:

Note that each of the input elements for psChannel gets consumed by every consumer.
请注意, psChannel 的每个输入元素都会被每个使用者使用。

Other relevant topics that might be of interest include KillSwitch for stream completion control and PartitionHub for routing Stream elements from a given producer to a dynamic set of consumers.
可能感兴趣的其他相关主题包括用于流完成控制的 KillSwitch 和用于将流元素从给定生产者路由到一组动态使用者的 PartitionHub。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Custom Akka Stream Processing
自定义 Akka 流处理

The Akka Stream API comes with a suite of versatile tools for stream processing. Besides the Graph DSL, a set of built-in stream operators is also readily available. Yet, if more custom streams are needed, GraphStage allows one to create streaming operators with specific stream processing logic between the input and output ports.
Akka Stream API 附带了一套用于流处理的多功能工具。除了图形DSL,一组内置的流运算符也很容易获得。但是,如果需要更多自定义流, GraphStage 允许在输入和输出端口之间创建具有特定流处理逻辑的流运算符。

As illustrated in the Akka Stream doc re: custom stream processing, one can come up with a transformation function like map or filter with a custom GraphStage in just a few lines of code. For example, method map can be implemented as a Flow using GraphStage:
如 Akka Stream 文档 re:自定义流处理 所示,只需几行代码即可使用自定义 GraphStage 提出像 mapfilter 这样的转换函数。例如,方法 map 可以使用 GraphStage 实现为流:

In analyzing a given stream operation, it’s easier to reason about the flow logic starting from the downstream and trace upward. With that in mind, let’s look at the above snippet. When there is a demand from the downstream to pull an element out of the output port, callback method onPull is called which initiates a pull of a new element into the input port which, upon push from the upstream, triggers the onPush callback to grab the the element on the input port, apply function f and push it to the output port.
在分析给定的流操作时,更容易推理从下游开始向上跟踪的流逻辑。考虑到这一点,让我们看一下上面的片段。当下游需要将元素拉出输出端口时,调用回调方法 onPull ,该方法启动将新元素拉入输入端口,在从上游推送时,触发 onPush 回调以抓取输入端口上的元素,应用函数 f 并将其推送到输出端口。

What is a GraphStage?
什么是GraphStage?

A GraphStage represents a stream processing operator. Below is the source code of abstract classes GraphStage and GraphStageLogic:
GraphStage 表示流处理运算符。以下是抽象类GraphStage和GraphStageLogic的源代码:

To use (i.e. extend) a GraphStage, one needs to implement method createLogic which returns a GraphStageLogic that takes a shape and consists of defined method setHandler which, in turn, takes two arguments Inlet/Outlet and InHandler/OutHandler. These InHandler and OutHandler routines are where the custom processing logic for every stream element resides.
要使用(即扩展) GraphStage ,需要实现方法 createLogic ,该方法返回一个 GraphStageLogic ,该方法采用 shape 并由定义的方法 setHandler 组成,而该方法又采用两个参数Inlet/Out和InHandler/OutHandler。这些 InHandler 和 OutHandler 例程是每个流元素的自定义处理逻辑所在的位置。

As illustrated in the map or filter implementation in the mentioned Akka doc, to define a GraphStage one would need to minimally define in, out and shape (FlowShape in those examples) of the graph, as well as the stream processing logic in the InHander and OutHandler.
如上述 Akka 文档中的 mapfilter 实现所示,要定义 GraphStage,需要至少定义图形的 inoutshape (在这些示例中为 FlowShape),以及 InHander 和 OutHandler 中的流处理逻辑。

Handling external asynchronous events
处理外部异步事件

Among various customizing features, one can extend a GraphStage to handle asynchronous events (i.e. Scala Futures) that aren’t part of the stream. To do that, simply define a callback using getAsyncCallback to create an AsyncCallback, which will be invoked by the external event via method invoke.
在各种自定义功能中,可以扩展GraphStage来处理不属于流的异步事件(即Scala Futures)。为此,只需使用 getAsyncCallback 定义一个回调来创建 AsyncCallback ,外部事件将通过方法 invoke 调用该回调。

As an exercise for building custom stream processing operators with GraphStage, we’re going to modify the above map Flow to one that dynamically changes the transformation function upon triggering by an asynchronous event. Let’s name the class DynamicMap which takes a switch event of type Future[Unit] and two ‘A => B’ transformation functions (f being the original function and g the switched one).
作为使用 GraphStage 构建自定义流处理运算符的练习,我们将上述 map 流修改为在异步事件触发时动态更改转换函数的流。让我们命名类 DynamicMap ,它接受一个 Future[Unit] 类型的 switch 事件和两个 'A => B' 转换函数( f 是原始函数, g 是切换函数)。

In this case, callback simply modifies variable flipped from the initial false value to true so that when being pushed on the input port the InHandler will now push g(elem) rather than f(elem) to the output port. In addition, an ExecutionContext, required by method invoke for callback invocation, is passed in as an implicit parameter.
在这种情况下, callback 只是将变量 flipped 从初始 false 值修改为 true ,以便在输入端口上推送时,InHandler 现在会将 g(elem) 而不是 f(elem) 推送到输出端口。此外,方法 invoke 需要回调调用的 ExecutionContext 作为隐式参数传入。

Note that to avoid race conditions, the callback is defined and invoked using the preStart lifecycle hook, rather than in the constructor of GraphStageLogic.
请注意,为了避免争用条件,回调是使用 preStart 生命周期挂钩定义和调用的,而不是在 GraphStageLogic 的构造函数中。

Testing DynamicMap with a dummy asynchronous event switch that simply returns in a milliseconds and a couple of trivial ‘DataIn => DataOut’ transformation functions:
使用虚拟异步事件 switch 测试 DynamicMap ,该事件只需在几毫秒内返回,并执行几个简单的“DataIn => DataOut”转换函数:

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Spark – Schema With Nested Columns
Spark – 具有嵌套列的架构

Extracting columns based on certain criteria from a DataFrame (or Dataset) with a flat schema of only top-level columns is simple. It gets slightly less trivial, though, if the schema consists of hierarchical nested columns.
从仅包含顶级列的平面架构的数据帧(或数据集)中提取基于特定条件的列非常简单。但是,如果架构由分层嵌套列组成,则它变得不那么简单。

Recursive traversal
递归遍历

In functional programming, a common tactic to traverse arbitrarily nested collections of elements is through recursion. It’s generally preferred over using while-loops with mutable counters. For performance at scale, making the traversal tail-recursive may be necessary – although it’s less of a concern in this case given that a DataFrame typically consists not more than a few hundreds of columns and a few levels of nesting.
在函数式编程中,遍历任意嵌套的元素集合的常见策略是通过递归。它通常比使用带有可变计数器的 while 循环更可取。为了大规模提高性能,可能需要使遍历尾递归 - 尽管在这种情况下不太重要,因为 DataFrame 通常包含不超过几百个列和几个嵌套级别。

We’re going to illustrate in a couple of simple examples how recursion can be used to effectively process a DataFrame with a schema of nested columns.
我们将在几个简单的示例中说明如何使用递归来有效地处理具有嵌套列架构的数据帧。

Example #1:  Get all nested columns of a given data type
示例 #1:获取给定数据类型的所有嵌套列

Consider the following snippet:
请考虑以下代码片段:

By means of a simple recursive method, the data type of each column in the DataFrame is traversed and, in the case of StructType, recurs to traverse its child columns. A string-type prefix during the traversal is assembled to express the hierarchy of the individual nested columns and gets prepended to columns with the matching data type.
通过简单的递归方法,遍历数据帧中每一列的数据类型,如果是 StructType ,则重复遍历其子列。在遍历期间组装字符串类型 prefix 以表示各个嵌套列的层次结构,并附加到具有匹配数据类型的列前面。

Testing the method:
测试方法:

Example #2:  Rename all nested columns via a provided function
示例#2:通过提供的函数重命名所有嵌套列

In this example, we’re going to rename columns in a DataFrame with a nested schema based on a provided rename function. The required logic for recursively traversing the nested columns is pretty much the same as in the previous example.
在此示例中,我们将使用基于提供的 rename 函数的嵌套架构重命名数据帧中的列。递归遍历嵌套列所需的逻辑与前面的示例几乎相同。

Testing the method (with the same DataFrame used in the previous example):
测试方法(使用上一示例中使用的相同数据帧):

In case it isn’t obvious, in traversing a given StructType‘s child columns, we use map (as opposed to flatMap in the previous example) to preserve the hierarchical column structure.
如果不明显,在遍历给定 StructType 的子列时,我们使用 map (与上一个示例中的 flatMap 相反)来保留分层列结构。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala’s groupMap And groupMapReduce
Scala的groupMap和groupMapReduce

For grouping elements in a Scala collection by a provided key, the de facto method of choice has been groupBy, which has the following signature for an Iterable:
为了通过提供的键对 Scala 集合中的元素进行分组,事实上的选择方法是 groupBy ,它对 Iterable 具有以下签名:

It returns an immutable Map of elements each consisting of a key and a collection of values of the original type. To process this collection of values in the resulting Map, Scala provides a method mapValues with the below signature:
它返回一个不可变的元素映射,每个元素由一个键和一个原始类型的值集合组成。为了在生成的 Map 中处理此值集合,Scala 提供了一个具有以下签名的方法 mapValues:

This groupBy/mapValues combo proves to be handy for processing the values of the Map generated from the grouping. However, as of Scala 2.13, method mapValues is no longer available.
事实证明,此 groupBy/mapValues 组合对于处理从分组生成的 Map 值非常方便。但是,从 Scala 2.13 开始,方法 mapValues 不再可用。

groupMap

A new method, groupMap, has emerged for grouping of a collection based on provided functions for defining the keys and values of the resulting Map. Here’s the signature of method groupMap for an Iterable:
出现了一种新的方法,groupMap,用于基于提供的函数对集合进行分组,以定义生成的Map的键和值。这是 Iterable 的方法组映射的签名:

Let’s start with a simple example grouping via the good old groupBy method:
让我们从一个简单的示例开始,通过旧的 groupBy 方法进行分组:

We can replace groupBy with groupMap like below:
我们可以将 groupBy 替换为 groupMap ,如下所示:

In this particular case, the new method doesn’t offer any benefit over the old one.
在这种特殊情况下,新方法与旧方法相比没有任何好处。

Let’s look at another example that involves a collection of class objects:
让我们看另一个涉及类对象集合的示例:

If we want to list all pet names per species, a groupBy coupled with mapValues will do:
如果我们想列出每个物种的所有宠物名称, groupBymapValues 相结合即可:

But in this case, groupMap can do it with better readability due to the functions for defining the keys and values of the resulting Map being nicely placed side by side as parameters:
但在这种情况下, groupMap 可以更好地读取,因为用于定义结果 Map 的键和值的函数作为参数很好地并排放置:

groupMapReduce

At times, we need to perform reduction on the Map values after grouping of a collection. This is when the other new method groupMapReduce comes in handy:
有时,我们需要在对集合进行分组后对 Map 值执行缩减。这时,另一个新方法组MapReduce派上用场了:

Besides the parameters for defining the keys and values of the resulting Map like groupMap, groupMapReduce also expects an additional parameter in the form of a binary operation for reduction.
除了用于定义生成的 Map 的键和值的参数(如 groupMap )之外, groupMapReduce 还需要一个二进制运算形式的附加参数来简化。

Using the same pets example, if we want to compute the count of pets per species, a groupBy/mapValues approach will look like below:
使用相同的宠物示例,如果我们想计算每个物种的宠物数量, groupBy/mapValues 方法将如下所示:

With groupMapReduce, we can “compartmentalize” the functions for the keys, values and reduction operation separately as follows:
使用 groupMapReduce ,我们可以分别“划分”键、值和归约操作的函数,如下所示:

One more example:
再举一个例子:

Let’s say we want to compute the monthly total of list price and discounted price of the product list. In the groupBy/mapValues way:
假设我们要计算产品列表的每月标价和折扣价的总和。以 groupBy/mapValues 方式:

Using groupMapReduce:
使用 groupMapReduce

2 thoughts on “Scala’s groupMap And groupMapReduce
关于“Scala's groupMap And groupMapReduce”的2条思考

  1. Jim Newton
    吉姆·牛顿 十二月26,2021 2在:上午29

    It is interesting when some bizarre piece of code turns out to be a well established pattern.
    当一些奇怪的代码片段被证明是一个成熟的模式时,这很有趣。

    I was able to refactor a piece of code which was creating a huge amount of GC pressure into
    我能够重构一段代码,该代码正在产生大量的GC压力

    a single call to groupMapReduce. The code is much shorter, and the GC pressure was eliminated.
    对 groupMapReduce 的单个调用。代码更短,并且消除了气相色谱压力。

    Reply
    回复 ↓
    1. Leo Cheung Post author
      张国荣邮报作者 十二月26,2021 9在:上午54

      Thanks for the comment Jim. This discussion thread about issues re: Scala groupBy (i.e. the “early” return of a Map and common use case of having to re-transform the returned Map with groupBy/mapValues) may or may not be exactly the performance problem that prompted your refactoring work. Nonetheless, it appears to have motivated the creation of methods groupMap and groupMapReduce.
      谢谢你的评论吉姆。这个关于问题的讨论线程:Scala groupBy(即 Map 的“早期”返回以及必须用 groupBy/mapValues 重新转换返回的 Map 的常见用例)可能是也可能不是促使重构工作的性能问题。尽管如此,它似乎激发了方法 groupMapgroupMapReduce 的创建。

      Reply
      回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Composing Partial Functions In Scala
在 Scala 中组合部分函数

Just like partial functions in mathematics, a partial function in Scala is a function whose domain doesn’t cover all elements of the domain’s data type. For example:
就像数学中的偏函数一样,Scala 中的偏函数是一个函数,其域未涵盖域数据类型的所有元素。例如:

It’s a function defined for all non-zero integers, but f(0) would produce a java.lang.ArithmeticException.
这是一个为所有非零整数定义的函数,但 f(0) 会产生 java.lang.ArithmeticException

By defining it as a partial function like below:
通过将其定义为如下所示的分部函数:

we will be able to leverage PartialFunction’s methods like isDefinedAt to check on a given element before applying the function to it.
我们将能够利用 PartialFunction 的方法(如 isDefinedAt)在将函数应用于给定元素之前检查给定元素。

Methods lift and unlift
方法提升和解除

Scala provides a method lift for “lifting” a partial function into a total function that returns an Option type. Using the above partial function as an example:
Scala 提供了一个方法 lift ,用于将部分函数“提升”为返回 Option 类型的总函数。以上述部分函数为例:

Simple enough. Conversely, an Option-typed total function can be “unlifted” to a partial function. Applying unlift to the above function f would create a new partial function same as pf:
足够简单。相反,期权类型的总函数可以“解除”为部分函数。将 unlift 应用于上述函数 f 将创建一个与 pf 相同的新分部函数:

Function compositions
函数组成

For simplicity, we’ll look at only functions with arity 1 (i.e. Function1, which takes a single argument). It’s trivial to use the same concept to apply to FunctionN.
为简单起见,我们将仅查看 arity 为 1 的函数(即 Function1 ,它接受单个参数)。使用相同的概念应用于 FunctionN 是微不足道的。

Methods like andThen and compose enable compositions of Scala functions. Since both methods are quite similar, I’m going to talk about andThen only. Readers who would like to extend to compose may try it as a programming exercise.
andThencompose 这样的方法支持Scala函数的组合。由于这两种方法非常相似,我将只讨论 andThen 。想要扩展到 compose 的读者可以尝试将其作为编程练习。

Method andThen for Function1[T1, R] has the following signature:
方法 和 然后 对于 Function1[T1, R] 具有以下签名:

A trivial example:
举个简单的例子:

Now, let’s replace the 2nd function add1 with a partial function inverse:
现在,让我们将第二个函数 add1 替换为部分函数 inverse

Note that doubleThenInverse still returns a total function even though the composing function is partial. That’s because PartialFunction is a subclass of Function:
请注意, doubleThenInverse 仍然返回一个 total 函数,即使组合函数是部分函数。这是因为 Partial Function 是 Function 的一个子类:

hence method andThen rightfully returns a total function as advertised.
因此,方法 andThen 正确地返回一个通告的总函数。

Unfortunately, that’s undesirable as the resulting function lost the inverse partial function’s domain information.
不幸的是,这是不可取的,因为生成的函数丢失了 inverse 部分函数的域信息。

Partial function compositions
部分功能组合

Method andThen for PartialFunction[A, C] has its signature as follows:
方法 和 然后 对于 PartialFunction[A, C] 具有如下签名:

Example:

That works perfectly, since any given element not in the domain of any of the partial functions being composed should have its corresponding element(s) eliminated from the domain of the composed function. In this case, 0.5 is not in the domain of pfMap, hence its corresponding element, 2 (which would have been inverse-ed to 0.5), should not be in inverseThenPfMap‘s domain.
这非常有效,因为任何不在正在组合的任何部分函数域中的给定元素都应该从组合函数的域中删除其相应的元素。在这种情况下,0.5 不在 pfMap 的域中,因此其对应的元素 2(本来是 inverse -ed 到 0.5)不应该在 inverseThenPfMap 的域中。

Unfortunately, I neglected to mention I’m on Scala 2.13.x. For Scala 2.12 or below, inverseThenPfMap.isDefinedAt(2) would be true.
不幸的是,我忽略了我在Scala 2.13.x上。对于 Scala 2.12 或更低版本,inverseThenPfMap.isDefinedAt(2) 将为 true

Turning composed functions into a proper partial function
将组合函数转换为适当的偏函数

Summarizing what we’ve looked at, there are two issues at hand:
总结一下我们所看到的,手头有两个问题:

  1. If the first function among the functions being composed is a total function, the composed function is a total function, discarding domain information of any subsequent partial functions being composed.
    如果正在组合的函数中的第一个函数是全函数,则组合函数是全函数,丢弃正在组合的任何后续部分函数的域信息。
  2. Unless you’re on Scala 2.13+, with the first function being a partial function, the resulting composed function is a partial function, but its domain would not embody domain information of any subsequent partial functions being composed.
    除非你在 Scala 2.13+ 上,第一个函数是部分函数,否则生成的组合函数是部分函数,但它的域不会包含任何后续部分函数的域信息。

To tackle the issues, one approach is to leverage implicit conversion by defining a couple of implicit methods that handle composing a partial function on a total function and on a partial function, respectively.
为了解决这些问题,一种方法是通过定义几个隐式方法来利用隐式转换,这些方法分别处理在总函数和部分函数上组成部分函数。

Note that the implicit methods are defined as methods within implicit class wrappers, a common practice for the implicit conversion to carry out by invoking the methods like calling class methods.
请注意,隐式方法定义为 implicit class 包装器中的方法,这是通过调用类方法等方法执行隐式转换的常见做法。

In the first implicit class, function f (i.e. the total function to be implicitly converted) is first transformed to return an Option, chained using flatMap to the lifted partial function (i.e. the partial function to be composed), followed by an unlift to return a partial function.
在第一个隐式类中,函数 f (即要隐式转换的总函数)首先被转换为返回一个 Option ,使用 flatMap 链接到提升的部分函数(即要组成的部分函数),然后是 unlift 返回部分函数。

Similarly, in the second implicit class, function pf (i.e. the partial function to be implicitly converted) is first lifted, chained to the lifted partial function (i.e. the partial function to be composed), followed by an unlift.
类似地,在第二个隐式类中,函数 pf (即要隐式转换的部分函数)首先被提升,链接到提升的部分函数(即要组成的部分函数),然后是 unlift

In both cases, andThenPF returns a partial function that incorporates the partial domains of the functions involved in the function composition.
在这两种情况下, andThenPF 返回一个分部函数,该函数包含函数组合中涉及的函数的分部域。

Let’s reuse the double and inverse functions from a previous example:
让我们重用前面示例中的 doubleinverse 函数:

Recall from that example that doubleThenInverse is a total function. Now, let’s replace andThen with our custom andThenPF:
回想一下该示例中的 doubleThenInverse 是一个总函数。现在,让我们将 andThen 替换为我们的自定义 andThenPF

The resulting function is now a partial function with the composing function’s partial domain as its own domain. I’ll leave testing for the cases in which the function to be composed is a partial function to the readers.
生成的函数现在是一个偏置函数,组合函数的偏域作为其自己的域。我将把测试留给读者,以测试要组合的函数是部分函数的情况。

1 thought on “Composing Partial Functions In Scala
关于“在 Scala 中组合部分函数”的 1 条思考

  1. Erik Bruchez.
    埃里克·布鲁切斯。十一月13,2020 2时:下午39

    Thanks for the post: I found what I was looking for regarding composing partial functions. I didn’t know that Scala 2.13 supported that correctly but not Scala 2.12!
    感谢您的帖子:我找到了我正在寻找的有关组合部分函数的内容。我不知道 Scala 2.13 是否正确支持它,但 Scala 2.12 不支持!

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Ad-hoc Polymorphism In Scala
Scala 中的临时多态性

Over the past few years, there seems to be a subtle trend of software engineers favoring typeclass patterns that implement polymorphism in an ad-hoc fashion, namely, Ad-hoc Polymorphism. To see the benefits of such kind of polymorphism, let’s first look at what F-bounded polymorphism, a subtype polymorphism, has to offer.
在过去的几年中,似乎有一种微妙的趋势,即软件工程师倾向于以临时方式实现多态性的类型类模式,即临时多态性。要了解这种多态性的好处,让我们首先看看 F 界多态性(一种子型多态性)必须提供什么。

F-bounded polymorphism
F 界多态性

Next, a couple of child classes are defined:
接下来,定义几个子类:

A F-bounded type has a peculiar signature of the self-recursive A[T <: A[T]] which mandates the given type T itself a sub-type of A[T], like how type Sedan is defined (Sedan <: Car[Sedan]). Note that the self-type annotation used in the trait isn’t requirement for F-bounded type. Rather, it’s a common practice for safeguarding against undesirable mix-up of sub-classes like below:
F-有界类型具有自递归 A[T <: A[T]] 的特殊签名,它要求给定的类型 T 本身是 A[T] 的子类型,就像定义类型 Sedan 一样(轿车<:Car[轿车])。请注意,特征中使用的自类型注释不是 F 边界类型的必需。相反,这是一种常见的做法,用于防止子类的不良混淆,如下所示:

“Type argument” versus “Type member”
“类型参数”与“类型成员”

Rather than a type argument, a F-bounded type could also be expressed as a type member which needs to be defined in its child classes.:
除了 type argument ,F 边界类型也可以表示为 type member ,这需要在其子类中定义。

It should be noted that with the type member approach, self-type would not be applicable, hence mix-up of sub-classes mentioned above is possible.
应该注意的是,使用 type member 方法,self 类型将不适用,因此可能会混淆上述子类。

Let’s define a sedan and test out method setPrice:
让我们定义一个轿车并测试方法 setPrice

Under the F-bounded type’s “contract”, a method such as the following would work as intended to return the specified sub-type:
在 F 边界类型的“协定”下,如下方法将按预期工作,以返回指定的子类型:

Had the Car/Sedan hierarchy been set up as the less specific T <: Car, the corresponding method:
如果将汽车/轿车层次结构设置为不太具体的 T <: Car ,则相应的方法:

would fail as it couldn’t guarantee the returning type is the exact type of the input.
将失败,因为它无法保证返回的类型是输入的确切类型。

F-bounded type collection
F 边界类型集合

Next, let’s look at a collection of cars.
接下来,让我们看一下汽车的集合。

The resulting type is a rather ugly sequence of gibberish. To help the compiler a little, give it some hints about T <: Car[T] as shown below:
生成的类型是一个相当丑陋的胡言乱语序列。为了对编译器有所帮助,请给它一些关于 T <: Car[T] 的提示,如下所示:

Ad-hoc polymorphism
临时多态性

Contrary to subtype polymorphism which orients around a supertype with a rigid subtype structure, let’s explore a different approach using typeclasses, known as Ad-hoc polymorphism.
与围绕具有刚性子类型结构的超类型定向的子类型多态性相反,让我们探索一种使用类型类的不同方法,称为 Ad-hoc 多态性。

Next, a couple of “ad-hoc” implicit objects are created to implement the trait methods.
接下来,创建几个“临时”隐式对象来实现特征方法。

Note that alternatively, the implicit objects could be set up as ordinary companion objects of the case classes with implicit anonymous classes:
请注意,或者,可以将隐式对象设置为具有隐式匿名类的 case 类的普通配套对象:

Unifying implemented methods
统一实现的方法

Finally, an implicit conversion for cars of type T is provided by means of an implicit class to create a “unified” method that takes the corresponding method implementations from the provided implicit Car[T] parameter.
最后,通过隐式类为类型 T 的汽车提供隐式转换,以创建一个“统一”方法,该方法从提供的隐式 Car[T] 参数中获取相应的方法实现。

Testing it out:
测试一下:

New methods, like setSalePrice, can be added as needed in the implicit objects:
新方法,如 setSalePrice ,可以根据需要添加到隐式对象中:

Ad-hoc type collection
临时类型集合

Next, a collection of cars:
接下来,是一系列汽车:

Similar to the F-bounded collection, the inferred resulting type isn’t very helpful. Unlike in the F-bounded case, we do not have a T <: Car[T] contract. Using an approach illustrated in this blog post, we could assemble the collection as a list of (car, type) tuples:
与 F 边界集合类似,推断的结果类型不是很有帮助。与 F 有界的情况不同,我们没有 T <: Car[T] 合约。使用这篇博文中说明的方法,我们可以将集合组装为 (car, type) 元组的列表:

By means of a simple example, we’ve now got a sense of how Ad-hoc polymorphism works. The F-bounded example serves as a contrasting reference of how the polymorphism bound by a more “strict” contract plays out in comparison. Given the flexibility of not having to bind the base classes into a stringent subtype relationship upfront, the rising popularity of Ad-hoc polymorphism certainly has its merits.
通过一个简单的例子,我们现在已经了解了Ad-hoc多态性是如何工作的。F有界的例子可以作为对比参考,说明由更“严格”的契约约束的多态性如何进行比较。鉴于不必预先将基类绑定到严格的子类型关系中的灵活性,Ad-hoc 多态性的日益普及当然有其优点。

That said, lots of class models in real-world applications still fits perfectly well into a subtype relationship. In suitable use cases, F-bounded polymorphism generally imposes less boilerplate code. In addition, Ad-hoc polymorphism typically involves using of implicits that may impact code maintainability.
也就是说,实际应用中的许多类模型仍然非常适合子类型关系。在合适的用例中,F 边界多态性通常强加较少的样板代码。此外,临时多态性通常涉及使用可能影响代码可维护性的隐式。

3 thoughts on “Ad-hoc Polymorphism In Scala
关于“Scala 中的临时多态性”的 3 条思考

  1. Pingback: Scala Type 暗黑化 | YGao的奇幻冒险

  2. Pingback: Orthogonal Typeclass In Scala | Genuine Blog
    pingback: Orthogonal Typeclass in Scala |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Merkle Tree Implementation In Scala
Merkle Tree 在 Scala 中的实现

A Merkle tree, a.k.a. hash tree, is a tree in which every leaf node contains a cryptographic hash of a dataset, and every branch node contains a hash of the concatenation of the corresponding hashes of its child nodes. Typical usage is for efficient verification of the content stored in the tree nodes.
Merkle树,又名哈希树,是一棵树,其中每个叶节点都包含数据集的加密哈希,每个分支节点都包含其子节点的相应哈希串联的哈希。典型用法是高效验证存储在树节点中的内容。

Blockchain and Merkle tree
区块链和默克尔树

As cryptocurrency (or more generally, blockchain system) has become popular, so has its underlying authentication-oriented data structure, Merkle tree. In the cryptocurrency world, a blockchain can be viewed as a distributed ledger consisting of immutable but chain-able blocks, each of which hosts a set of transactions in the form of a Merkle tree. In order to chain a new block to an existing blockchain, part of the tamper-proof requirement is to guarantee the integrity of the enclosed transactions by composing their hashes in a specific way and storing them in a Merkle tree.
随着加密货币(或更普遍的区块链系统)变得流行,其面向身份验证的底层数据结构Merkle树也是如此。在加密货币世界中,区块链可以被视为由不可变但可链的区块组成的分布式账本,每个区块都以Merkle树的形式托管一组交易。为了将新区块链接到现有区块链,防篡改要求的一部分是通过以特定方式组合其哈希并将其存储在Merkle树中来保证封闭交易的完整性。

In case the above sounds like gibberish, here’s a great introductory article about blockchain. To delve slight deeper into it with a focus on cryptocurrency, this blockchain guide from the Bitcoin Project website might be of interest. Just to be clear, even though blockchain helps popularize Merkle tree, implementing a flavor of the data structure does not require knowledge of blockchain or cryptocurrency.
如果以上听起来像胡言乱语,这里有一篇关于区块链的精彩介绍性文章。为了更深入地研究加密货币,比特币项目网站上的这本区块链指南可能会引起人们的兴趣。需要明确的是,尽管区块链有助于普及默克尔树,但实现数据结构的风格并不需要区块链或加密货币的知识。

In this blog post, we will assemble a barebone Merkle tree using Scala. While a Merkle tree is most often a binary tree, it’s certainly not confined to be one, although that’s what we’re going to implement.
在这篇博文中,我们将使用 Scala 组装一个准系统默克尔树。虽然Merkle树通常是二叉树,但它当然不限于二叉树,尽管这就是我们要实现的。

A barebone Merkle tree class
准系统默克尔树类

Note that when both the class fields left and right are None, it represents a leaf node.
请注意,当类字段 leftright 都是 None 时,它表示一个叶节点。

To build a Merkle tree from a collection of byte-arrays (which might represent a transaction dataset), we will use a companion object to perform the task via its apply method. To create a hash within each of the tree nodes, we will also need a hash function, hashFcn of type Array[Byte] => Array[Byte].
为了从字节数组集合(可能表示事务数据集)构建 Merkle 树,我们将使用配套对象通过其 apply 方法执行任务。要在每个树节点中创建 hash ,我们还需要一个哈希函数 hashFcn 类型为 Array[Byte] => Array[Byte]

Building a Merkle tree
建立默克尔树

As shown in the code, what’s needed for function buildTree is to recursively pair up the nodes to form a tree with each of its nodes consisting the combined hash of their corresponding child nodes. The recursive pairing will eventually end with the single top-level node called the Merkle root. Below is an implementation of such a function:
如代码所示,函数 buildTree 需要的是递归配对节点以形成树,其每个节点都包含其相应子节点的组合哈希。递归配对最终将以称为默克尔根的单个顶级节点结束。下面是这样一个函数的实现:

Now, back to class MerkleTree, and let’s add a simple function for computing height of the tree:
现在,回到类 MerkleTree ,让我们添加一个简单的函数来计算树的高度:

Putting all the pieces together
将所有部分放在一起

For illustration purpose, we’ll add a side-effecting function printNodes along with a couple of for-display utility functions so as to see what our Merkle tree can do. Putting everything altogether, we have:
为了便于说明,我们将添加一个副作用函数 printNodes 以及几个用于显示的实用程序函数,以便查看我们的 Merkle 树可以做什么。把所有东西放在一起,我们有:

Test building the Merkle tree with a hash function
使用哈希函数测试构建默克尔树

By providing the required arguments for MerkleTree’s apply factory method, let’s create a Merkle tree with, say, 5 dummy byte-arrays using a popular hash function SHA-256. The created Merkle tree will be represented by its tree root, a.k.a. Merkle Root:
通过为 MerkleTree 的 apply 工厂方法提供所需的参数,让我们使用流行的哈希函数 SHA-256 创建一个包含 5 个虚拟字节数组的 Merkle 树。创建的默克尔树将由其树根表示,又名默克尔根:

As can be seen from the output, the 5 dummy data blocks get hashed and placed in the 5 leaf nodes, each with its hash value wrapped with its sibling’s (if any) in another hash and placed in the parent node.
从输出中可以看出,5 个虚拟数据块被散列并放置在 5 个叶节点中,每个叶节点的哈希值与其同级(如果有的话)一起包装在另一个哈希中并放置在父节点中。

For a little better clarity, below is an edited output in a tree structure:
为了更清楚一点,下面是树结构中编辑的输出:

Building a Merkle tree from blockchain transactions
从区块链交易构建默克尔树

To apply the using of Merkle tree in the blockchain world, we’ll substitute the data block with a sequence of transactions from a blockchain.
为了在区块链世界中应用Merkle树的使用,我们将用区块链中的一系列交易替换数据块。

First, we define a trivialized Transaction class with the transaction ID being the hash value of the combined class fields using the same hash function sha256:
首先,我们定义一个简单化的 Transaction 类,其事务 ID 是使用相同的哈希函数 sha256 的组合类字段的哈希值:

Next, we create an array of transactions:
接下来,我们创建一个事务数组:

Again, using MerkleTree’s apply factory method, we build a Merkle tree consisting of hash values of the individual transaction IDs, which in turn are hashes of their corresponding transaction content:
同样,使用 MerkleTree 的 apply 工厂方法,我们构建了一个 Merkle 树,其中包含各个交易 ID 的哈希值,这些哈希值又是其相应事务内容的哈希值:

The Merkle root along with the associated transactions are kept in an immutable block. It’s also an integral part of the elements to be collectively hashed into the block-identifying hash value. The block hash will serve as the linking block ID for the next block that manages to successfully append to it. All the cross-hashing operations coupled with the immutable block structure make any attempt to tamper with the blockchain content highly expensive.
Merkle 根以及关联的交易保存在一个不可变的块中。它也是要集体散列到块标识哈希值中的元素的组成部分。区块哈希将作为成功附加到它的下一个区块的链接区块 ID。所有的交叉散列操作加上不可变的块结构使得任何篡改区块链内容的尝试都非常昂贵。

1 thought on “Merkle Tree Implementation In Scala
关于“Scala 中的默克尔树实现”的 1 条思考

  1. Pingback: Transaction Hash Tree In A Blockchain | Genuine Blog
    pingback:区块链中的交易哈希树 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Transaction Hash Tree In A Blockchain
区块链中的交易哈希树

I’m starting a mini blog series that centers around the Blockchain topic. At the end of the series will be a simple blockchain application in Scala on an Actor-based Akka cluster. The application will in some way follow a simplified version of the Bitcoin cryptocurrency’s operational model, including its proof-of-work consensus algorithm.
我正在开始一个围绕区块链主题的迷你博客系列。在本系列的最后,将在基于Actor的Akka集群上介绍Scala中的简单区块链应用程序。该应用程序将以某种方式遵循比特币加密货币运营模型的简化版本,包括其工作量证明共识算法。

Cryptocurrency and Blockchain
加密货币和区块链

Some quick background info about blockchain – In 2009, Bitcoin emerged as the first decentralized cryptocurrency and took the world by storm. Besides proving to the world the possibility of running a digital currency without the need of a centralized authority, it has also fascinated people (particularly in the finance and technology industries) with its simple yet effective operational model.
关于区块链的一些快速背景信息 – 2009 年,比特币成为第一个去中心化加密货币,并风靡全球。除了向世界证明在不需要中央机构的情况下运行数字货币的可能性外,它还以其简单而有效的运营模式吸引了人们(特别是在金融和技术行业)。

Cryptocurrency has also popularized the term “blockchain” which represents its underlying data structure and has since been broadened to a computing class that covers a wide range of applications (e.g. “smart contracts”) in different domains. Even though conceptually how a cryptocurrency like Bitcoin works isn’t complicated, it does require some basic knowledge in cryptography, particularly in PKCS (public key cryptography standards).
加密货币还普及了代表其底层数据结构的术语“区块链”,此后已扩展到涵盖不同领域广泛应用(例如“智能合约”)的计算类。尽管从概念上讲,像比特币这样的加密货币的工作原理并不复杂,但它确实需要一些密码学的基本知识,特别是在PKCS(公钥加密标准)中。

Utility functions

First, a few utility functions:
首先,几个实用程序函数:

Hashing is a critical process for integrity check in blockchain’s underlying data structure. We’ll use SHA-256, which has a function signature of Array[Byte] => Array[Byte]. We’ll also need some minimal cryptographic functions for creating public keys that serve as IDs for user accounts.
哈希是区块链底层数据结构完整性检查的关键过程。我们将使用 SHA-256,它的函数签名为 Array[Byte] => Array[Byte]。我们还需要一些最小的加密函数来创建用作用户帐户 ID 的公钥。

Basic cryptography
基本加密

To load Base64 public keys from the key files commonly in PKCS#8 PEM format on a file system, we use Bouncy Castle and Apache Commons Codec. As a side note, neither of the additional packages would be needed if the key files were in PKCS#8 DER format, which is binary and less commonly used.
为了从文件系统上通常以PKCS#8 PEM格式的密钥文件加载Base64公钥,我们使用Bouncy Castle和Apache Commons Codec。作为旁注,如果密钥文件是 PKCS#8 DER 格式,则不需要额外的包,该格式是二进制且不太常用的。

Transactions in a Blockchain
区块链中的交易

With some basic utility and crypto functions in place, we now create class Account, which represents a user (e.g. a transaction originator or a miner) with the user’s cryptographic public key as the account ID. The corresponding private key, supposedly kept in private by the user, is for decrypting a transaction encrypted with the key. In our simplified model, the transactions won’t be encrypted hence private keys won’t be used.
有了基本的实用程序和加密功能,我们现在创建了类Account,它代表用户(例如交易发起者或矿工),用户的加密公钥作为帐户ID。相应的私钥,据说由用户保密,用于解密使用该密钥加密的交易。在我们的简化模型中,交易不会被加密,因此不会使用私钥。

Next, we create class TransactionItem that represents a single transaction.
接下来,我们创建表示单个事务的类 TransactionItem。

The id of TransactionItem is the hash value of the concatenated class fields in bytes. Note that the apply factory method performs the necessary hashing of the provided arguments to assemble a TransactionItem with the hash-value ID.
事务项的 id 是以字节为单位的串联类字段的哈希值。请注意, apply 工厂方法对提供的参数执行必要的哈希处理,以使用哈希值 ID 组装事务项。

Next, we define class Transactions, representing a collection of TransactionItems. The id of Transactions is just a random-UUID. It could’ve been defined as a collective hash value like TransactionItem’s id to ensure content integrity but we’re going to leave that to be taken care in a hash-tree data structure, the Merkle Tree.
接下来,我们定义类事务,表示事务项的集合。事务的 id 只是一个随机UUID。它可以被定义为一个集体哈希值,如TransactionItem的id,以确保内容的完整性,但我们将在哈希树数据结构Merkle树中处理它。

For illustration, we create a method that instantiates a Transactions object consisting of a random number of TransactionItems, each with a pair of random Accounts and a random amount (remember each Account needs a public key as its id).
为了说明这一点,我们创建了一个实例化由随机数量的交易项组成的交易对象的方法,每个交易项都有一对随机账户和一个随机金额(请记住,每个账户需要一个公钥作为其 id )。

Generating a couple of Transactions objects with random content:
生成几个包含随机内容的事务对象:

Hashing transactions into a Merkle tree
将事务哈希到默克尔树中

Merkle trees (or hash trees) are commonly used in blockchain computing. The main purpose of using a hash tree is to guarantee the authenticity of the contents in the dataset by successively composing their hashes in a hard-to-tamper fashion while keeping the resulting data structure relatively lightweight.
默克尔树(或哈希树)通常用于区块链计算。使用哈希树的主要目的是通过以难以篡改的方式连续组合哈希来保证数据集中内容的真实性,同时保持生成的数据结构相对轻量级。

In a previous blog post re: Merkle tree implementation in Scala, we saw how a hash tree can be created from a collection of dataset. Even though the transaction collection data structure is now slightly more complex than the trivial example in the previous post, there is no added complexity in creating the hash tree. Borrowing a slight variant of the Merkle tree class from that post:
在之前的博客文章中:Scala中的Merkle树实现,我们看到了如何从数据集集合创建哈希树。尽管事务集合数据结构现在比上一篇文章中的简单示例稍微复杂一些,但创建哈希树并没有增加复杂性。借用该帖子中Merkle树类的轻微变体:

Using MerkleTree’s apply factory method, we simply supply transactions objects we’ve created as the method argument:
使用 MerkleTree 的 apply 工厂方法,我们只需提供我们创建的事务对象作为方法参数:

For transaction collection trans1, Merkle root mRoot1 is all that is needed to ensure its integrity. So is mRoot2 for trans2. For a given collection of transactions, the recursive hashing of the transaction items in the tree nodes all the way to the root node makes it mathematically difficult to tamper with the transaction content. The Merkle root along with the associated transaction collection will be kept in an immutable “block”.
对于事务集合 trans1 ,Merkle 根 mRoot1 是确保其完整性所需的全部内容。 mRoot2 对于 trans2 也是如此。对于给定的事务集合,树节点中的事务项一直到根节点的递归哈希使得在数学上难以篡改事务内容。Merkle根以及相关的交易集合将保存在一个不可变的“块”中。

While the term “blockchain” has been used ad libitum throughout the post, we have not seen anything remotely resembling a “block” yet, have we? So far, we’ve only put in place some simple data structures along with a few utility/crypto functions. Nonetheless, they’re the essential elements for the building “block” of a blockchain, which we’ll dig into in the next post of this blog series.
虽然“区块链”一词在整篇文章中被随意使用,但我们还没有看到任何类似于“块”的东西,不是吗?到目前为止,我们只设置了一些简单的数据结构以及一些实用程序/加密函数。尽管如此,它们是区块链构建“块”的基本要素,我们将在本博客系列的下一篇文章中深入探讨。

3 thoughts on “Transaction Hash Tree In A Blockchain
关于“区块链中的交易哈希树”的 3 条思考

  1. Pingback: An Akka Actor-based Blockchain | Genuine Blog
    pingback:基于Akka演员的区块链|正版博客

  2. Pingback: Blockchain Mining And Proof-of-Work | Genuine Blog
    pingback: 区块链挖掘和工作量证明 |正版博客

  3. Pingback: Actor-based Blockchain In Akka Typed | Genuine Blog
    pingback:Akka类型中基于演员的区块链|正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Blockchain Mining And Proof-of-Work
区块链挖掘和工作量证明

This is the second part of the Blockchain mini-series. Core focus of this post is to define the building “block” of a blockchain and illustrate how to “mine” a block in Scala.
这是区块链迷你系列的第二部分。这篇文章的核心重点是定义区块链的构建“块”,并说明如何在Scala中“挖掘”一个区块。

First, let’s make available the key entity classes (Account, Transactions, MerkleTree, etc) and utility/crypto functions we assembled in the previous post.
首先,让我们提供我们在上一篇文章中组装的关键实体类(账户、交易、MerkleTree 等)和实用程序/加密函数。

The building “block” of a Blockchain
区块链的构建“块”

We now define the building “block” of a blockchain. In our simple model, it’s a “reversed” linked list with each node consisting of the hash value of its preceding block. Key fields in the Block class include the hash values of the current and preceding blocks, the transaction collection and the corresponding Merkle root, block creation timestamp and a couple of fields for Proof of Work (PoW) – a consensus algorithm we’ll get into shortly.
我们现在定义了区块链的构建“块”。在我们的简单模型中,它是一个“反向”链表,每个节点由其前一个块的哈希值组成。Block 类中的关键字段包括当前和先前区块的哈希值、交易集合和相应的 Merkle 根、区块创建时间戳和工作量证明 (PoW) 的几个字段——我们很快就会介绍一种共识算法。

Let’s define object ProofOfWork, which includes just a couple of values relevant to the PoW process, as follows for now:
让我们定义对象 ProofOfWork,它只包含几个与 PoW 过程相关的值,现在如下所示:

RootBlock is the primordial block (a.k.a. genesis block) and any subsequent block chained after it is a LinkedBlock which consists of an additional field blockPrev, a “pointer” to the preceding block. For simplicity in function signatures, the hash function hashFcn for the Block subclasses and other previously defined classes is not provided as constructor or class method arguments but rather a predefined function for all.
RootBlock 是原始块(又名创世块),任何在它之后链接的后续块都是一个 LinkedBlock,它由一个额外的字段 blockPrev(指向前一个块的“指针”)组成。为了简化函数签名,Block 子类和其他以前定义的类的哈希函数 hashFcn 不作为构造函数或类方法参数提供,而是为所有函数提供预定义函数。

Proof of Work – a computationally demanding task
工作量证明 – 计算要求很高的任务

Proof of Work (PoW) is a kind of consensus algorithms, arguably popularized by Bitcoin. Due to the lack of a centralized authoritative source like the central bank for a conventional currency, a decentralized cryptocurrency network needs a consensus protocol that every participant agrees to.
工作量证明(PoW)是一种共识算法,可以说是比特币推广的。由于缺乏像中央银行这样传统货币的集中式权威来源,去中心化的加密货币网络需要一个每个参与者都同意的共识协议。

The idea of PoW is to incentivize participants to work on a computationally demanding task in order to be qualified for adding a new block of unspent transactions to the existing blockchain – the distributed ledger. The incentive is a reward, typically a specific amount of the cryptocurrency, the network offers to the participants who:
PoW的想法是激励参与者从事计算要求苛刻的任务,以便有资格向现有区块链(分布式账本)添加新的未花费交易块。激励是一种奖励,通常是特定数量的加密货币,网络向以下参与者提供:

  1. completed the task, and,
    完成了任务,并且,
  2. successfully added to the blockchain a new block consisting of the task completion proof.
    成功将一个由任务完成证明组成的新块添加到区块链中。

These participants are referred to as the miners. The copy of blockchain maintained by any competing miner with the highest PoW value (generally referred to the length of the blockchain) overrides the rest.
这些参与者称为 miners 。由任何具有最高PoW值(通常称为区块链长度)的竞争矿工维护的区块链副本将覆盖其余部分。

A commonly employed PoW scheme is to require a miner to repeatedly apply a hash function to a given string of text combined with an incrementing integer until the resulting hash value has a certain number of leading zeros. Mathematically this is a task requiring exponentially more trials for every additional zero in the requirement.
一种常用的PoW方案是要求矿工重复地将哈希函数应用于给定的文本字符串,并结合一个递增的整数,直到生成的哈希值具有一定数量的前导零。从数学上讲,这是一项任务,要求中每增加一个零,就需要成倍增加的试验。

We now expand object ProofOfWork to include all the key elements for PoW.
我们现在扩展了对象工作量证明,以包括 PoW 的所有关键元素。

Borrowing some terms from the Bitcoin model, defaultDifficulty is the default difficulty level of the PoW task. The level value represents the number of leading zeros required to be in the resulting hash value after repeatedly applying a hash function against the concatenation of the hash of a block and a monotonically increasing integer. The incremental integer is called nonce, and defaultNonce is the network-default value. Note that, by design, the output of hashing is so unpredictable and sensitive to input change that any given starting nonce value will not have any advantage over any other values.
借用比特币模型的一些术语, defaultDifficulty 是PoW任务的默认难度级别。级别值表示在对块的哈希和单调递增的整数的串联重复应用哈希函数后,结果哈希值中所需的前导零数。增量整数称为 noncedefaultNonce 是网络默认值。请注意,根据设计,哈希的输出非常不可预测,并且对输入更改非常敏感,以至于任何给定的起始随机数值都不会比任何其他值有任何优势。

Method validProof concatenates a dataset of type byte-array and a byte-converted nonce, applies hashFcn to the combined data and reports whether the resulting hash value has the number of leading zeros specified by difficulty. And method generateProof is a simple recursive snippet that takes a Base64 string and runs validProof repeatedly with an monotonically incrementing nonce until it satisfies the difficulty requirement.
方法 validProof 连接字节数组类型的数据集和字节转换的 nonce ,将 hashFcn 应用于组合数据,并报告生成的哈希值是否具有 difficulty 指定的前导零数。方法 generateProof 是一个简单的递归代码段,它采用 Base64 字符串并使用单调递增的 nonce 重复运行 validProof ,直到它满足难度要求。

Generating Proof

Through repeated trials of hashing with the incrementing nonce value, the “proof” is the first incremented nonce value that satisfies the requirement at the set difficulty level. As it could take significant time to arrive at the required hash value, it’s logical to perform PoW asynchronously.
通过反复尝试使用递增的随机数值进行哈希处理,“证明”是在设定的难度级别满足要求的第一个递增的随机数值。由于达到所需的哈希值可能需要很长时间,因此异步执行 PoW 是合乎逻辑的。

Using the scheduleOnce Akka scheduler to complete a Promise with a TimeoutException after a certain elapsed time, we’ve created an asynchronous method with a non-blocking timeout mechanism (as opposed to Await) that returns a Future of the wanted nonce value.
使用 scheduleOnce Akka 调度程序在经过一定时间后使用 TimeoutException 完成 Promise ,我们创建了一个具有非阻塞超时机制(与 Await 相反)的异步方法,该方法返回所需随机数值的未来。

Test running generatePoW

To test out generating Proof of Work, we borrow the random account/transaction creation snippet from the previous blog:
为了测试生成工作量证明,我们借用了上一篇博客中的随机帐户/交易创建片段:

Generating a few Transactions objects with random content, each to be used in building a block:
生成一些包含随机内容的 Transactions 对象,每个对象都用于构建一个区块:

The miner will use their own account to receive the blockchain reward upon successful block acceptance. Creating a new block requires a couple of things:
矿工将在成功接受区块后使用自己的账户获得区块链奖励。创建新块需要做几件事:

  • the reference to the last block in the existing blockchain
    对现有区块链中最后一个区块的引用
  • a collection of unspent transactions from a distributed queue
    分布式队列中未花费事务的集合

The miner will then save the last block reference as blockPrev in the new block, prepend to the transaction collection an additional transaction with their own account reference for the reward, and start the PoW process. Upon finishing the PoW, the incremented nonce that satisfies the requirement at the predefined difficulty level will be kept in the new block as the proof for validation by the blockchain system. Below is the method mine that creates the new block.
然后,矿工将在新区块中将最后一个区块引用保存为 blockPrev ,在交易集合前面附加一笔带有他们自己的账户参考的交易以获得奖励,然后启动 PoW 过程。完成 PoW 后,满足预定义难度级别的要求的递增随机数将保留在新块中,作为区块链系统验证的证明。下面是创建新块的方法 mine

Let’s just say the miner owns the first of the previously created accounts:
假设矿工拥有之前创建的第一个帐户:

We’ll start mining a new block to be appended to the root block (i.e. the genesis block RootBlock).
我们将开始挖掘一个新区块,以附加到根区块(即创世区块RootBlock)。

Now that the last block of the blockchain is block1, a subsequent mining attempt will make it the preceding block; and further mining will be performed like so, successively.
现在区块链的最后一个区块是block1,随后的挖矿尝试将使其成为前一个区块;进一步的采矿将像这样依次进行。

A couple of notes:
几点注意事项:

  1. At difficulty level 3, the 20-second timeout is generally sufficient on a computer with, say, a 3GHz 4-core CPU, although one may still get a timeout once in a while. If we elevate it to level 4, the required proof-generating time will be about 100-fold (i.e. 30+ minutes) of what its takes for level 3.
    在难度级别 3 下,20 秒的超时在具有 3GHz 4 核 CPU 的计算机上通常就足够了,尽管有时仍可能出现超时。如果我们将其提升到级别 4,所需的证明生成时间将是级别 3 所需的大约 100 倍(即 30+ 分钟)。
  2. The incremented nonce saved in the returned block is the exact number of hashing trials to have finally satisfied the PoW requirement at the predefined difficulty level. At difficulty level 3, the number of trials ranges from millions to tens of millions. A level up would require billions of trials.
    返回块中保存的递增随机数是最终满足预定义难度级别的 PoW 要求的哈希试验的确切次数。在难度级别3中,审判的数量从数百万到数千万不等。升级将需要数十亿次试验。

To look at what the final blockchain (i.e. block3) is like, let’s create a simple function that converts the linked blocks to elements of a List:
为了看看最终的区块链(即block3)是什么样的,让我们创建一个简单的函数,将链接的块转换为List的元素:

As a side note, the above detailed steps are for illustration purpose. The list of transactions and blocks could’ve been created in a more concise manner:
作为旁注,上述详细步骤仅用于说明目的。交易和区块列表可以以更简洁的方式创建:

Proof validation

Validating a proof (i.e. the incremented nonce) saved in a block is simple. Object ProofOfWork includes method validProofIn(block: Block) that takes a block and verifies whether the hash function can reproduce using the block’s hash and nonce a result with the number of leading zeros matching the difficulty level.
验证保存在块中的证明(即递增的随机数)很简单。Object ProofOfWork 包括方法 validProofIn(block: Block) ,该方法采用一个块并验证哈希函数是否可以使用块的哈希值重现,并随机化具有与难度级别匹配的前导零数的结果。

For example, the following verification confirms that block3 has a valid proof.
例如,以下验证确认 block3 具有有效证明。

In the next blog post, we’ll port the above proof-of-concept Scala snippets to run on a scalable Akka cluster.
在下一篇博文中,我们将移植上述概念验证 Scala 代码片段,以便在可扩展的 Akka 集群上运行。

1 thought on “Blockchain Mining And Proof-of-Work
关于“区块链挖掘和工作量证明”的 1 条思考

  1. Pingback: Actor-based Blockchain In Akka Typed | Genuine Blog
    pingback:Akka类型中基于演员的区块链|正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

An Akka Actor-based Blockchain
基于Akka演员的区块链

As proposed at the beginning of this blockchain mini blog series, we’ll have an Actor-based blockchain application at the end of the series. The fully functional application is written in Scala along with the Akka toolkit.
正如本区块链迷你博客系列开头所建议的那样,我们将在本系列结束时提供一个基于Actor的区块链应用程序。功能齐全的应用程序与 Akka 工具包一起用 Scala 编写。

While this is part of a blog series, this post could still be viewed as an independent one that illustrates the functional flow of a blockchain application implemented in Scala/Akka.
虽然这是博客系列的一部分,但这篇文章仍然可以被视为一篇独立的文章,它说明了在Scala/Akka中实现的区块链应用程序的功能流程。

What is a blockchain?
什么是区块链?

Summarizing a few key characteristics of a blockchain (primarily from the angle of a cryptocurrency system):
总结区块链的几个关键特征(主要从加密货币系统的角度):

  • At the core of a cryptocurrency system is a distributed ledger with a collection of transactions stored in individual “blocks” each of which is successively chained to another, thus the term “blockchain”.
    加密货币系统的核心是一个分布式账本,其中包含存储在单个“块”中的交易集合,每个“块”都相继链接到另一个,因此称为“区块链”。
  • There is no centralized database storing the ledger as the authoritative data source. Instead, each of the decentralized “nodes” maintains its own copy of the blockchain that gets updated in a consensual fashion.
    没有集中式数据库将账本存储为权威数据源。相反,每个去中心化的“节点”都维护着自己的区块链副本,该副本以协商一致的方式进行更新。
  • At the heart of the so-called “mining” process lies a “consensus” algorithm that determines how participants can earn the mining reward as an incentive for them to collaboratively grow the blockchain.
    所谓的“挖矿”过程的核心是一种“共识”算法,该算法决定了参与者如何获得挖矿奖励,作为他们合作发展区块链的激励。
  • One of the most popular consensus algorithms is Proof of Work (PoW), which is a computationally demanding task for the “miners” to compete for a reward (i.e. a certain amount of digital coins) offered by the system upon successfully adding a new block to the existing blockchain.
    最流行的共识算法之一是工作量证明(PoW),这是一项计算要求很高的任务,“矿工”在成功将新区块添加到现有区块链后竞争系统提供的奖励(即一定数量的数字硬币)。
  • In a cryptocurrency system like Bitcoin, the blockchain that has the highest PoW value (generally measured by the length, or technically referred to as height) of the blockchain overrides the rest.
    在像比特币这样的加密货币系统中,区块链具有最高PoW值(通常由 length 或技术上称为 height )测量的区块链覆盖其余部分。

Beyond cryptocurrency
超越加密货币

While blockchain is commonly associated with cryptocurrency, the term has been generalized to become a computing class (namely blockchain computing) covering a wide range of use cases, such as supply chain management, asset tokenization. For instance, Ethereum, a prominent cryptocurrency, is also a increasingly popular computing platform for building blockchain-based decentralized applications. Its codebase is primarily in Golang and C++.
虽然区块链通常与加密货币相关联,但该术语已被推广为一种计算类(即区块链计算),涵盖广泛的用例,例如供应链管理、资产标记化。例如,以太坊是一种著名的加密货币,也是构建基于区块链的去中心化应用程序的越来越流行的计算平台。它的代码库主要在Golang和C++。

Within the Ethereum ecosystem, Truffle (a development environment for decentralized applications) and Solidity (a JavaScript alike scripting language for developing “smart contracts”), among others, have prospered and attracted many programmers from different industry sectors to develop decentralized applications on the platform.
在以太坊生态系统中,Truffle(去中心化应用程序的开发环境)和Solidity(一种类似于JavaScript的脚本语言,用于开发“智能合约”)等已经蓬勃发展,并吸引了来自不同行业的许多程序员在平台上开发去中心化应用程序。

In the Scala world, there is a blockchain framework, Scorex 2.0 that allows one to build blockchain applications not limited to cryptocurrency systems. Supporting multiple kinds of consensus algorithms, it offers a versatile framework for developing custom blockchain applications. Its predecessor, Scorex, is what powers the Waves blockchain. As of this post, the framework is still largely in experimental stage though.
在Scala世界中,有一个区块链框架Scorex 2.0,它允许人们构建不仅限于加密货币系统的区块链应用程序。它支持多种共识算法,为开发自定义区块链应用程序提供了一个通用框架。它的前身Scorex是Waves区块链的动力。截至本文发布时,该框架在很大程度上仍处于实验阶段。

How Akka Actors fit into running a blockchain system
Akka演员如何适应运行区块链系统

A predominant implementation of the Actor model, Akka Actors offer a comprehensive API for building scalable distributed systems such as Internet-of-Things (IoT) systems. It comes as no surprise the toolset also works great for what a blockchain application requires.
作为Actor模型的主要实现,Akka Actor提供了一个全面的API,用于构建可扩展的分布式系统,如物联网(IoT)系统。毫不奇怪,该工具集也适用于区块链应用程序的需求。

Lightweight and loosely-coupled by design, actors can serve as an efficient construct to model the behaviors of the blockchain mining activities and autonomously maintain the internal state (e.g. the blockchain instance within a given actor). In addition, the non-blocking interactions among actors via message passing (i.e. the fire-and-forget tell method or query-alike ask method) allow individual modules to effectively interact with custom logic flow and share states, as needed, with each other. The versatile interaction functionality makes actors useful for building various kinds of modules from highly interactive routines such as simulation of transaction bookkeeping to request/response queries like blockchain validation.
轻量级和松散耦合的设计,参与者可以作为一个有效的结构来建模区块链挖掘活动的行为,并自主维护内部状态(例如给定参与者中的区块链实例)。此外,参与者之间通过消息传递(即发即弃 tell 方法或类似查询的 ask 方法)的非阻塞交互允许各个模块有效地与自定义逻辑流交互,并根据需要相互共享状态。多功能交互功能使参与者可用于构建各种模块,从高度交互的例程(如模拟交易簿记)到请求/响应查询(如区块链验证)。

On distributed cluster functionality, Akka provides a suite of cluster features — cluster-wide routing, distributed data replication, cluster sharding, distributed publish/subscribe, etc. There are different approaches to maintaining the decentralized blockchains on individual cluster nodes. For multiple independent mining processes to consensually share with each others their latest blockchains in a decentralized fashion, Akka’s distributed pub/sub proves to be a superb tool.
在分布式集群功能方面,Akka 提供了一套集群功能——集群范围的路由、分布式数据复制、集群分片、分布式发布/订阅等。在各个集群节点上维护去中心化区块链有不同的方法。对于多个独立的采矿过程,以分散的方式相互同意地共享他们最新的区块链,Akka的分布式发布/订阅被证明是一个极好的工具。

A blockchain application that mimics a simplified cryptocurrency system
模仿简化加密货币系统的区块链应用程序

UPDATE: A new version of this application that uses Akka Typed actors (as opposed to the Akka classic actors) is available. An overview of the new application is at this blog post. Also available is a mini blog series that describes the basic how-to’s for migrating from Akka classic to Akka Typed.
更新:此应用程序的新版本使用Akka类型演员(与Akka经典演员相反)可用。有关新应用程序的概述,请参阅此博客文章。还有一个迷你博客系列,描述了从 Akka 经典迁移到 Akka 类型的基本操作方法。

It should be noted that Akka Actors has started moving towards Typed Actors since release 2.5, although both classic and typed actors are being supported in the current 2.6 release. While the Akka Typed API which enforces type-safe code is now a stable release, it’s still relatively new and the API change is rather drastic, requiring experimental effort to ensure everything does what it advertises. Partly because of that, Akka classic actors are used in this blockchain application. Nonetheless, the code should run fine on both Akka 2.5 and 2.6.
应该注意的是,Akka Actor从2.5版本开始转向类型化Actor,尽管当前2.6版本支持经典和类型Actor。虽然强制执行类型安全代码的 Akka Typed API 现在是一个稳定的版本,但它仍然相对较新,并且 API 更改相当剧烈,需要实验性的努力来确保一切都按照它所宣传的那样。部分正因为如此,Akka经典演员被用于这个区块链应用程序。尽管如此,代码应该在 Akka 2.5 和 2.6 上运行良好。

Build tool for the application is the good old sbt, with the library dependencies specified in built.sbt and all the configurative values of the Akka cluster and blockchain specifics such as the proof-of-work difficulty level, mining reward, time-out settings in application.conf:
该应用程序的构建工具是 很好的旧 sbt ,在 built.sbt 中指定了库依赖项,以及 Akka 集群的所有配置值和区块链细节,例如工作量证明难度级别、挖矿奖励、 application.conf 中的超时设置:

Note that Artery TCP remoting, as opposed to the classic Netty-base remoting, is used.
请注意,使用动脉TCP远程处理,而不是经典的Netty碱基远程处理。

With the default configuration, the application will launch an Akka cluster on a single host with two seed nodes at port 2551 and 2552 for additional nodes to join the cluster. Each user can participate the network with their cryptographic public key (for collecting mining reward) provided as an argument for the main program on one of the cluster nodes to perform simulated mining tasks.
使用默认配置时,应用程序将在端口 2551 和 2552 处具有两个种子节点的单个主机上启动 Akka 集群,以便其他节点加入集群。每个用户都可以使用他们的加密公钥(用于收集挖矿奖励)参与网络,作为其中一个集群节点上的主程序的参数来执行模拟挖矿任务。

For illustration purpose, the main program will either by default enter a periodic mining loop with configurable timeout, or run a ~1 minute quick test by adding “test” to the program’s argument list.
为了便于说明,主程序将默认进入具有可配置超时的周期性挖掘循环,或者通过将“test”添加到程序的参数列表中来运行~1分钟的快速测试。

Functional flow of the blockchain application
区块链应用程序的功能流程

Rather than stepping through the application logic in text, the following diagram illustrates the functional flow of the Akka actor-based blockchain application:
下图不是在文本中逐步执行应用程序逻辑,而是说明了基于 Akka actor 的区块链应用程序的功能流:

Akka Blockchain - functional flow

Below is a summary of the key roles played by the various actors in the application:
以下是各种参与者在应用程序中发挥的关键作用的摘要:

Blockchainer – A top-level actor that maintains a distributed copy of the blockchain and transaction queue for a given network participant (e.g. a miner) identified by their cryptographic public key. It collects submitted transactions in the queue and updates the blockchain according to the consensual rules via the cluster-wide distributed pub/sub. Playing a managerial role, the actor delegates mining work to actor Miner and validation of mined blocks to actor BlockInspector.
Blockchainer – 一种顶级参与者,为由其加密公钥标识的给定网络参与者(例如矿工)维护区块链和交易队列的分布式副本。它收集队列中提交的交易,并通过集群范围的分布式发布/订阅根据共识规则更新区块链。扮演管理角色,参与者将挖矿工作委托给参与者矿工,并将挖矿区块的验证委托给参与者BlockInspector。

Miner – A child actor of Blockchainer responsible for processing mining tasks, carrying out computationally demanding Proof of Work using a non-blocking routine and returning the proofs back to the parent actor via the Akka ask pattern.
Miner – Blockchainer 的子参与者,负责处理挖掘任务,使用非阻塞例程执行计算要求苛刻的工作量证明,并通过 Akka ask 模式将证明返回给父参与者。

BlockInspector – Another child actor of Blockchainer for validating content of a given block, typically a newly mined block. The validation verifies that generated proof and goes “vertically” down to the nested data structure (transactions/transactionItems, merkleRoot, etc) within a block as well as “horizontally” across all the preceding blocks. The result is then returned to the parent actor via Akka ask.
BlockInspector – Blockchainer的另一个子参与者,用于验证给定区块的内容,通常是新开采的区块。验证验证生成的 proof ,并“垂直”向下到块内的嵌套数据结构(事务/事务项,merkleRoot等),以及“水平”跨所有前面的块。然后,结果通过 Akka ask 返回给父演员。

Simulator – A top-level actor that simulates mining requests and transaction submissions sent to actor Blockchainer. It spawns periodic mining requests by successively calling Akka scheduler function scheduleOnce with randomized variants of configurable time intervals. Transaction submissions are delegated to actor TransactionFeeder.
模拟器 – 模拟发送给参与者 Blockchainer 的挖矿请求和交易提交的顶级参与者。它通过连续调用具有可配置时间间隔的随机变体的 Akka 调度程序函数 scheduleOnce 来生成定期挖掘请求。事务提交被委托给参与者 TransactionFeeder。

TransactionFeeder – A child actor of actor Simulator responsible for periodically submitting transactions to actor Blockchainer via an Akka scheduler. Transactions are created with random user accounts and transaction amounts. Since accounts are represented by their cryptographic public keys, a number of PKCS#8 PEM keypair files under “{project-root}/src/main/resources/keys/” were created in advance to save initial setup time.
TransactionFeeder – actor模拟器的子参与者,负责通过Akka调度程序定期向Actor Blockchainer提交交易。交易是使用随机用户帐户和交易金额创建的。由于帐户由其加密公钥表示,因此提前在“{project-root}/src/main/resources/keys/”下创建了多个PKCS#8 PEM密钥对文件,以节省初始设置时间。

As for the underlying data structures including Account, Transactions, MerkleTree, Block and ProofOfWork, it’s rather trivial to sort out their inter-relationship by skimming through the relevant classes/companion objects in the source code. For details at the code level of 1) how they constitute the “backbone” of the blockchain, and 2) how Proof of Work is carried out in the mining process, please refer to the previous couple of posts of this mini series.
至于底层数据结构包括 AccountTransactionsMerkleTreeBlockProofOfWork ,通过浏览源代码中的相关类/伴随对象来整理它们的相互关系是相当简单的。有关代码级别的详细信息:1)它们如何构成区块链的“骨干”,以及2)如何在挖矿过程中进行工作量证明,请参阅本迷你系列的前几篇文章。

Complete source code of the blockchain application is available at GitHub.
区块链应用程序的完整源代码可在GitHub上找到。

Test running the blockchain application
测试运行区块链应用程序

Below is sample console output with edited annotations from an Akka cluster of two nodes, each running the blockchain application with the default configuration on its own JVM.
下面是示例控制台输出,其中包含来自两个节点的 Akka 集群的编辑注释,每个节点在自己的 JVM 上使用默认配置运行区块链应用程序。

Note that, for illustration purpose, each block as defined in trait Block‘s toString method:
请注意,为了便于说明,在 trait BlocktoString 方法中定义的每个块:

is represented in an abbreviated format as:
以缩写格式表示为:

where proof is the first incremented nonce value in PoW that satisfies the requirement at the specified difficulty level.
其中 proof 是 PoW 中满足指定 difficulty 级别要求的第一个递增 nonce 值。

As can be seen in the latest copies of blockchain maintained on the individual cluster nodes, they get updated via distributed pub/sub in accordance with the consensual rule, but still may differ from each other (typically by one or more most recently added blocks) when examined at any given point of time.
从各个群集节点上维护的 blockchain 的最新副本中可以看出,它们根据共识规则通过分布式发布/订阅进行更新,但在任何给定时间点检查时仍然可能彼此不同(通常是一个或多个最近添加的块)。

Reliability and efficiency
可靠性和效率

The application is primarily for proof of concept, hence the abundant side-effecting console logging for illustration purpose. From a reliability and efficiency perspective, it would benefit from the following enhancements to be slightly more robust:
该应用程序主要用于概念验证,因此出于说明目的,有大量的副作用控制台日志记录。从可靠性和效率的角度来看,它将受益于以下增强功能,使其更加健壮:

  • Fault tolerance: Akka Persistence via journals and snapshots over Redis, Cassandra, etc, woud help recover an actor’s state in case of a system crash. In particular, the distributed blockchain copy (and maybe transactionQueue as well) maintained within actor Blockchainer could be crash-proofed with persistence. One approach would be to refactor actor Blockchainer to delegate maintenance of blockchain to a dedicated child PersistentActor.
    容错:Akka 通过日志和快照在 Redis、Cassandra 等上进行持久性,有助于在系统崩溃时恢复参与者的状态。特别是,在参与者Blockchainer中维护的分布式 blockchain 副本(也许还有 transactionQueue )可以通过持久性进行防崩溃。一种方法是重构actor Blockchainer,将 blockchain 的维护委托给专用的子PersistentActor。
  • Serialization: Akka’s default Java serializer is known for not being very efficient. Other serializers such as Protocol Buffers, Kryo should be considered.
    序列化:Akka的默认Java序列化程序以效率不高而闻名。应考虑其他序列化程序,例如Protocol Buffers,Kryo。

Feature enhancement
功能增强

Feature-wise, the following enhancements would help make the application one step closer to a real-world cryptocurrency system:
在功能方面,以下增强功能将有助于使应用程序更接近现实世界的加密货币系统:

  • Data privacy: Currently the transactions stored in the blockchain isn’t encrypted, despite PKCS public keys are being used within individual transactions. The individual transaction items could be encrypted, each of which to be stored with the associated cryptographic public key/signature, requiring miners to validate the signature while allowing only those who have the private key for certain transactions to see the content.
    数据隐私:目前存储在区块链中的交易未加密,尽管PKCS公钥在单个交易中使用。单个交易项目可以加密,每个项目都与相关的加密公钥/签名一起存储,要求矿工验证签名,同时只允许那些拥有某些交易私钥的人查看内容。
  • Self regulation: A self-regulatory mechanism that adjusts the difficulty level of the Proof of Work in accordance with network load would help stabilize the currency. As an example, in a recent drastic plunge of the Bitcoin market value in mid March, there was reportedly a significant self-regulatory reduction in the PoW difficulty to temporarily make mining more rewarding that helped dampen the fall.
    自我调节:根据网络负载调整工作量证明难度的自我调节机制将有助于稳定货币。例如,在最近3月中旬比特币市值的大幅下跌中,据报道,PoW难度显着降低了自我监管难度,以暂时使采矿更有回报,这有助于抑制下跌。
  • Currency supply: In a cryptocurrency like Bitcoin, issuance of the mining reward by the network is essentially the “minting” of the digital coins. To keep inflation rate under control as the currency supply grows, the rate of coin minting must be proportionately regulated over time. For instance, Bitcoin has a periodic “halfing” mechanism that reduces the mining reward by half for every 210,000 blocks added to the blockchain and will cease producing new coins once the total supply reaches 21 million coins.
    货币供应:在像比特币这样的加密货币中,网络发放挖矿奖励本质上是数字硬币的“铸造”。为了随着货币供应的增长而控制通货膨胀率,必须随着时间的推移按比例调节硬币铸造率。例如,比特币有一个周期性的“半半化”机制,每增加21万个区块,挖矿奖励就会减少一半,一旦总供应量达到2100万个硬币,就会停止生产新硬币。
  • Blockchain versioning: Versioning of the blockchain would make it possible for future algorithmic changes by means of a fork, akin to Bitcoin’s soft/hard forks, without having to discard the old system.
    区块链版本控制:区块链的版本控制将使未来的算法更改成为可能,通过 fork ,类似于比特币的软/硬分叉,而不必丢弃旧系统。
  • User Interface: The existing application focuses mainly on how to operate a blockchain network, thus supplementing it with, say, a Web-based user interface (e.g. using Play framework) would certainly make it a more complete system.
    用户界面:现有的应用程序主要关注如何操作区块链网络,因此用基于Web的用户界面(例如使用Play框架)补充它肯定会使其成为一个更完整的系统。

4 thoughts on “An Akka Actor-based Blockchain
关于“基于Akka演员的区块链”的4条思考

  1. Pingback: From Akka Untyped To Typed Actors | Genuine Blog
    回调:从阿卡非类型化到类型化演员 |正版博客

  2. Pingback: Actor-based Blockchain In Akka Typed | Genuine Blog
    pingback:Akka类型中基于演员的区块链|正版博客

  3. Pingback: A Brief Overview Of Blockchains | Genuine Blog
    Pingback:区块链的简要概述 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Orthogonal Typeclass In Scala
Scala 中的正交类型类

As an addendum to a previous blog post on the topic of ad-hoc polymorphism in Scala, I’m adding another common typeclass pattern as a separate post. The term “orthogonal” refers to a pattern that selected class attributes are taken out from the base class to form an independent typeclass.
作为之前一篇关于 Scala 中临时多态性主题的博客文章的附录,我将添加另一个常见的 typeclass 模式作为单独的文章。术语“正交”是指从基类中取出选定的类属性以形成独立类型类的模式。

Using an ADT similar to the Car/Sedan/SUV example used in that previous post, we first define trait Car as follows:
使用类似于上一篇文章中使用的汽车/轿车/SUV 示例的 ADT,我们首先定义 trait Car ,如下所示:

Unlike how the base trait was set up as a typeclass in the ad-hoc polymorphism example, trait Car is now an ordinary trait. But the more significant difference is that method setPrice() is no longer in the base class. It’s being constructed “orthogonally” in a designated typeclass:
与在临时多态性示例中将基本特征设置为类型类的方式不同, trait Car 现在是一个普通特征。但更显著的区别是方法 setPrice() 不再在基类中。它是在指定的类型类中“正交”构造的:

Similar to how implicit conversions are set up for ad-hoc polymorphism, implicit values are defined within the companion objects for the individual child classes to implement method setPrice() for specific car types.
与为即席多态性设置隐式转换的方式类似,隐式值在各个子类的配套对象中定义,以便为特定汽车类型实现方法 setPrice()

The specific method implementations are then abstracted into a “unified” method, setNewPrice(), via an implicit constructor argument by passing the Settable typeclass into the CarOps implicit class:
然后将特定的方法实现抽象为“统一”方法 setNewPrice() ,通过隐式构造函数参数,通过将 Settable 类型类传递到 CarOps 隐式类中:

Testing it out:
测试一下:

Putting all method implementations in one place
将所有方法实现放在一个地方

It’s worth noting that having the implicit values for method implementations defined in the companion objects for the individual classes is just one convenient way. Alternatively, these implicit values could all be defined in one place:
值得注意的是,在各个类的伴随对象中定义方法实现的隐式值只是一种方便的方法。或者,这些隐式值都可以在一个地方定义:

A benefit of putting all method implementations in one place is that new methods can be added without touching the base classes – especially useful in situations where those case classes cannot be altered.
将所有方法实现放在一个地方的好处是,可以在不接触基类的情况下添加新方法 - 在无法更改这些案例类的情况下特别有用。

For instance, if color is also an attribute of trait Car and its child case classes, adding a new color setting method will be a trivial exercise by simply adding a setColor() method signature in trait Settable and its specific method implementations as well as setNewColor() within class CarOps.
例如,如果 color 也是 trait Car 及其子案例类的属性,则只需在 trait Settable 及其特定方法实现中添加 setColor() 方法签名以及在类 CarOps 中添加 setNewColor() 即可。

Orthogonal type collection
正交类型集合

Let’s see what a collection of cars looks like:
让我们看看汽车集合是什么样子的:

To refine the inferred List[Product with java.io.Serializable] collection type, we could provide some type hints as shown below:
为了优化推断的 List[Product with java.io.Serializable] 集合类型,我们可以提供一些类型提示,如下所示:

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala Unfold

In Scala 2.13, method unfold is added to the standard library without a big bang it probably deserves. A first glance at the method signature might make one wonder how it could possibly be useful. Admittedly, it’s not intuitive to reason about how to make use of it. Although it’s new in the Scala standard library, a couple of Akka Stream operators like Source.unfold, Source.unfoldAsync with similar method signature and functionality have already been available for a while.
在 Scala 2.13 中,方法 unfold 被添加到标准库中,而没有它可能应得的大爆炸。乍一看方法签名可能会让人想知道它是如何有用的。诚然,推理如何使用它是不直观的。虽然它是 Scala 标准库中的新功能,但几个 Akka Stream 运算符(如 Source.unfoldSource.unfoldAsync 具有类似的方法签名和功能)已经存在了一段时间。

While the method is available for a number of Scala collections, I’ll illustrate the using of it with the Iterator collection. One reason Iterator is chosen is that its “laziness” allows the method to be used for generating infinite sequences. Here’s the method unfold’s signature:
虽然该方法可用于许多 Scala 集合,但我将说明如何使用它与 Iterator 集合。选择 Iterator 的一个原因是它的“惰性”允许该方法用于生成无限序列。这是方法展开的签名:

Method fold versus unfold
方法折叠与展开

Looking at a method by the name unfold, one might begin to ponder its correlation to method fold. The contrary between fold and unfold is in some way analogous to that between apply and unapply, except that it’s a little more intuitive to “reverse” the logic from apply to unapply than from fold to unfold.
看看一个名为 unfold 的方法,人们可能会开始思考它与方法折叠的相关性。 foldunfold 之间的相反在某种程度上类似于 apply 和 unapply,只是将逻辑从 apply “反转”到 unapply 比从 foldunfold 更直观。

Let’s take a look at the method signature of fold:
我们来看看 fold 的方法签名:

Given a collection (in this case, an Iterator), method fold allows one to iteratively transform the elements of the collection into an aggregated element of similar type (a supertype of the elements to be precise) by means of a binary operator. For example:
给定一个集合(在本例中为迭代器),方法 fold 允许通过二元运算符将集合的元素迭代转换为类似类型的聚合元素(准确地说是元素的超类型)。例如:

Reversing method fold
反转方式折叠

In the above example, the binary operator _ + _, which is a shorthand for (acc, x) => acc + x, iteratively adds a number from a sequence of number, and fold applies the operator against the given Iterator’s content starting with an initial number 1000. It’s in essence doing this:
在上面的示例中,二元运算符 _ + _(acc, x) => acc + x 的简写,它迭代地从数字序列中添加一个数字,并且 fold 将运算符应用于从初始数字 1000 开始的给定迭代器的内容。它本质上是这样做的:

To interpret the “reverse” logic in a loose fashion, let’s hypothesize a problem with the following requirement:
为了以松散的方式解释“反向”逻辑,让我们假设一个具有以下要求的问题:

Given the number 1055 (the “folded” sum), iteratively assemble a monotonically increasing sequence from 1 such that subtracting the cumulative sum of the sequence elements from 1055 remains larger than 1000.
给定数字 1055(“折叠”总和),迭代地组装一个从 1 开始单调递增的序列,使得从 1055 中减去序列元素的累积和仍然大于 1000。

Here’s one way of doing it using unfold:
以下是使用 unfold 执行此操作的一种方法:

How does unfold work?
展开如何运作?

Recall that Iterator’s unfold has the following method signature:
回想一下,迭代器的 unfold 具有以下方法签名:

As can be seen from the signature, starting from a given “folded” initial state value, elements of a yet-to-be-generated sequence are iteratively “unfolded” by means of the function f. In each iteration, the returned tuple of type Option[(A, S)] determines a few things:
从签名可以看出,从给定的“折叠”初始状态值开始,尚未生成的序列的元素通过函数 f 迭代“展开”。在每次迭代中,返回的类型为 Option[(A, S)] 的元组确定一些内容:

  1. the 1st tuple element of type A is the new element to be added to the resulting sequence
    类型 A 的第一个元组元素是要添加到结果序列中的新元素
  2. the 2nd tuple element of type S is the next state value, revealing how the state is being iteratively mutated
    类型为 S 的第二个元组元素是下一个 state 值,揭示了状态是如何迭代变异的
  3. a returned Some((elem, state)) signals a new element being generated whereas a returned None signals the “termination” for the sequence generation operation
    返回的 Some((elem, state)) 表示正在生成新元素,而返回的 None 表示序列生成操作的“终止”

In the above example, the state is itself a tuple with initial state value (1, 1055) and next state value (i+1, n-i). The current state (i, n) is then iteratively transformed into an Option of tuple with:
在上面的示例中, state 本身是一个元组,初始状态值为 (1, 1055) ,下一个状态值为 (i+1, n-i) 。然后将当前状态 (i, n) 迭代转换为元组的 Option ,如下所示:

  • the element value incrementing from i to i+1
    i 递增到 i+1 的元素值
  • the state value decrementing from n to n-i, which will be iteratively checked against the n > 1000 condition
    状态值从 n 递减为 n-i ,将根据 n > 1000 条件进行迭代检查

Modified examples from Akka Stream API doc
来自 Akka Stream API 文档的修改示例

Let’s look at a couple of examples modified from the Akka Stream API doc for Source.unfold. The modification is minor but necessary due to difference in the method signatures.
让我们看几个从 Akka Stream API 文档修改的示例 Source.unfold .由于方法签名的差异,修改很小,但这是必要的。

Example 1:

This is a nice “Hello World” example of unfold. Following the above bullet points of how-it-works, #1 and #2 tell us the resulting sequence has a starting element count iteratively decremented by 1 and how-it-works #3 says when count is not larger than 0 (i.e. decremented down to 0) the sequence generation operation stops.
这是 unfold 的一个很好的“Hello World”示例。按照上述如何工作的项目符号,#1 和 #2 告诉我们生成的序列有一个起始元素 count 迭代递减 1,而 how-it-work#3 表示当 count 不大于 0(即递减到 0)时,序列生成操作停止。

Example 2:

This example showcases a slick way of generating a Fibonacci sequence. Here, we use a tuple as the initial state value, resulting in the operator returning a value with a nested tuple. Tuples are used for the state because each number in a Fibonacci sequence depends on two preceding numbers, Fib(n) = Fib(n-1) + Fib(n-2), hence in composing the sequence content we want to carry over more than one number in every iteration.
此示例展示了生成斐波那契数列的巧妙方法。在这里,我们使用元组作为初始 state 值,导致运算符返回一个带有嵌套元组的值。元组用于 state ,因为斐波那契数列中的每个数字都依赖于前面的两个数字 Fib(n) = Fib(n-1) + Fib(n-2) ,因此在组成序列内容时,我们希望在每次迭代中携带多个数字。

Applying the logic of how-it-works #1 and #2, if x and y represent the current and next elements, respectively, generated for the resulting sequence, x + y would be the value of the following element in accordance with the definition of Fibonacci numbers. In essence, the tuple state represents the next two values of elements to be generated. What about how-it-works #3? The missing of None case in the return value of the binary operator indicates that there is no terminating condition, hence we have an infinite sequence here.
应用工作原理 #1 和 #2 的逻辑,如果 xy 分别表示为结果序列生成的当前和下一个元素,则 x + y 将是下一个元素的值,符合斐波那契数的定义。实质上,元组 state 表示要生成的元素的接下来两个值。如何工作#3呢?二进制运算符的返回值中缺少 None 大小写表示没有终止条件,因此我们这里有一个无限序列。

Another example:

Here’s one more example which illustrates how one could generate a Factorial sequence using unfold.
下面是另一个示例,说明如何使用 unfold 生成阶乘序列。

In this example, we also use a tuple to represent the state, although there is a critical difference between what the tuple elements represent when compared with the previous example. By definition, the next number in a Factorial sequence only depends on the immediately preceding number, Fact(i+1) = Fact(i) * (i+1), thus the first tuple element, n * (i+1), takes care of that, defining what the next element of the resulting sequence will be. But there is also a need to carry over the next index value and that’s what the second tuple element is for. Again, without the None case in the return value of the binary operator, the resulting sequence will be infinite.
在这个例子中,我们也使用元组来表示 state ,尽管元组元素所表示的内容与前面的示例相比存在关键差异。根据定义,阶乘序列中的下一个数字仅取决于紧接在前面的数字 Fact(i+1) = Fact(i) * (i+1) ,因此第一个元组元素 n * (i+1) 负责这一点,定义结果序列的下一个元素是什么。但是还需要结转下一个索引值,这就是第二个元组元素的用途。同样,如果二元运算符的返回值中没有 None 情况,则结果序列将是无限的。

As a side note, we could also use method iterate that comes with Scala Iterator collection with similar iteration logic like below:
作为旁注,我们还可以使用 Scala Iterator 集合附带的方法迭代,具有类似的迭代逻辑,如下所示:

1 thought on “Scala Unfold
关于“斯卡拉展开”的 1 条思考

  1. jeroen dijkmeijer
    杰罗恩·迪克梅耶 一月6,2022 12时:下午04

    Thank you for this unfold explanation.
    谢谢你的这个展开的解释。

    Brain benching, but finally it unfolded for me. Thanks.
    大脑板凳,但最终它为我展开了。谢谢。

    If your blog would have a subscribe field I would have used.
    如果你的博客有一个订阅字段,我会使用。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Traversing A Scala Collection
穿越 Scala 集合

When we have a custom class in the form of an ADT, say, Container[A] and are required to process a collection of the derived objects like List[Container[A]], there might be times we want to flip the collection “inside out” to become a single “container” of collection Container[List[A]], and maybe further transform the inner collection with a function.
当我们有一个 ADT 形式的自定义类时,例如 Container[A] 并且需要处理像 List[Container[A]] 这样的派生对象的集合,有时我们可能希望将集合“由内而外”翻转为集合 Container[List[A]] 的单个“容器”,并可能使用函数进一步转换内部集合。

For those who are familiar with Scala Futures, the nature of such transformation is analogous to what method Future.sequence does. In case the traversal involves also applying to individual elements with a function, say, f: A => Container[B] to transform the collection into Container[List[B]], it’ll be more like how Future.traverse works.
对于那些熟悉Scala Futures的人来说,这种转换的性质类似于Future.sequence的方法。如果遍历还涉及使用函数(例如 f: A => Container[B] )应用于单个元素以将集合转换为 Container[List[B]] ,这将更像 Future.traverse 的工作方式。

To illustrate how we can come up with methods sequence and traverse for the collection of our own ADTs, let’s assemble a simple ADT Fillable[A]. Our goal is to create the following two methods:
为了说明我们如何为我们自己的 ADT 的集合提供方法 sequencetraverse ,让我们组装一个简单的 ADT Fillable[A] 。我们的目标是创建以下两种方法:

For simplicity, rather than a generic collection like IterableOnce, we fix the collection type to List.
为简单起见,我们不是像 IterableOnce 这样的泛型集合,我们将集合类型固定为 List

A simple ADT
一个简单的 ADT

It looks a little like a home-made version of Scala Option, but is certainly not very useful yet. Let’s equip it with a companion object and a couple of methods for transforming the element within a Fillable:
它看起来有点像Scala Option 的自制版本,但肯定不是很有用。让我们为它配备一个伴随对象和几个用于转换 Fillable 中元素的方法:

With slightly different signatures, methods map and flatMap are now available for transforming the element “contained” within a Fillable.
由于签名略有不同,方法 mapflatMap 现在可用于转换 Fillable 中的元素“包含”。

A couple of quick notes:
几个快速说明:

  • Fillable[A] is made covariant so that method map/flatMap is able to operate on subtypes of Fillable.
    可填充[A]是协变的,因此方法 map/flatMap 能够对 Fillable 的子类型进行操作。
  • Using of self-type annotation isn’t necessary here, but is rather a personal coding style preference.
    这里不需要使用自类型注释,而是个人编码风格的偏好。

Testing the ADT:
测试 ADT:

Sequencing a collection of Fillables
对可填充对象集合进行排序

Let’s assemble method sequence which will reside within the companion object. Looking at the signature of the method to be defined:
让我们组装将驻留在伴随对象中的方法 sequence 。查看要定义的方法的签名:

it seems logical to consider aggregating a List from scratch within Fillable using Scala fold. However, trying to iteratively aggregate a list out of elements from within their individual “containers” isn’t as trivial as it may seem. Had there been methods like get/getOrElse that unwraps a Fillable to obtain the contained element, it would’ve been straight forward – although an implementation leveraging a getter method would require a default value for the contained element to handle the Emptied case.
考虑使用 Scala foldFillable 中从头开始聚合 List 似乎是合乎逻辑的。但是,尝试从元素的各个“容器”中迭代聚合列表并不像看起来那么简单。如果有像 get/getOrElse 这样的方法解开 Fillable 以获取包含的元素,那将是直截了当的——尽管利用 getter 方法的实现需要包含元素的默认值来处理 Emptied 情况。

One approach to implement sequence using only map/flatMap would be to first map within the fold operation each Fillable element of the input List into a list-push function for the element’s contained value, followed by a flatMap to aggregate the resulting List by iteratively applying the list-push functions within the Fillable container:
仅使用 map/flatMap 实现 sequence 的一种方法是首先在 fold 操作中{2}将输入 List 的每个 Fillable 元素放入元素所包含值的列表推送函数中,然后使用 flatMap 通过在 Fillable 容器中迭代应用列表推送函数来聚合生成的 List

Note that pushToList within map is now regarded as a function that takes an element of type A and returns a List[A] => List[A] function. The expression fa.map(pushToList).flatMap(acc.map) is just a short-hand for:
请注意, map 中的 pushToList 现在被视为接受 A 类型元素并返回 List[A] => List[A] 函数的函数。表达式 fa.map(pushToList).flatMap(acc.map) 只是以下各项的简写:

In essence, the first map transforms element within each Fillable in the input list into a corresponding list-push function for the specific element, and the flatMap uses the individual list-push functions for the inner map to iteratively aggregate the list inside the resulting Fillable.
实质上,第一个 map 将输入列表中每个 Fillable 中的元素转换为特定元素的相应列表推送函数,而 flatMap 使用内部 map 的各个列表推送函数以迭代聚合生成的 Fillable 中的列表。

Traversing the Fillable collection
遍历可填充集合

Next, we’re going to define method traverse with the following signature within the companion object:
接下来,我们将在伴随对象中使用以下签名定义方法 traverse

In case it doesn’t seem obvious, based on the method signatures, sequence is really just a special case of traverse with f(a: A) = Fillable(a)
如果看起来不明显,根据方法签名, sequence 实际上只是带有 f(a: A) = Fillable(a)traverse 的特例

Similar to the way sequence is implemented, we’ll also use fold for iterative aggregating the resulting list. Since an element of type Fillable[A] when flatMap-ed with the provided function f would yield a Fillable[B], we’re essentially dealing with the same problem we did for sequence except that we type A is now replaced with type B.
sequence 的实现方式类似,我们还将使用 fold 对结果列表进行迭代聚合。由于当 flatMap -ed 与提供的函数 f 时,类型 Fillable[A] 的元素将产生 Fillable[B],我们本质上是在处理与 sequence 相同的问题,只是我们键入 A 现在被替换为类型 B

Putting everything together:
将所有内容放在一起:

Testing with the newly created methods:
使用新创建的方法进行测试:

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Implementing Linked List In Scala
在 Scala 中实现链表

In Scala, if you wonder why its standard library doesn’t come with a data structure called LinkedList, you may have overlooked. The collection List is in fact a linked list — although it often appears in the form of a Seq or Vector collection rather than the generally “mysterious” linked list that exposes its “head” with a hidden “tail” to be revealed only iteratively.
在 Scala 中,如果你想知道为什么它的标准库没有带有一个名为 LinkedList 的数据结构,你可能忽略了。集合列表实际上是一个链表——尽管它经常以 Seq 或 Vector 集合的形式出现,而不是通常“神秘”的链表,它暴露了它的“头”和一个隐藏的“尾巴”,只能迭代地显示。

Our ADT: LinkedNode
我们的ADT:LinkedNode

Perhaps because of its simplicity and dynamicity as a data structure, implementation of linked list remains a popular coding exercise. To implement our own linked list, let’s start with a barebone ADT (algebraic data structure) as follows:
也许是由于其作为数据结构的简单性和动态性,链表的实现仍然是一种流行的编码练习。为了实现我们自己的链表,让我们从准系统ADT(代数数据结构)开始,如下所示:

If you’re familiar with Scala List, you’ll probably notice that our ADT resembles List and its subclasses Cons (i.e. ::) and Nil (see source code):
如果你熟悉 Scala List,你可能会注意到我们的 ADT 类似于 List 及其子类 Cons (即 :: ) 和 Nil (参见源代码):

Expanding LinkedNode
扩展链接节点

Let’s expand trait LinkedNode to create class methods insertNode/deleteNode at a given index for inserting/deleting a node, toList for extracting contained elements into a List collection, and toString for display:
让我们扩展 trait LinkedNode 以在给定索引处创建类方法 insertNode / deleteNode 以插入/删除节点, toList 用于将包含的元素提取到 List 集合中, toString 用于显示:

Note that LinkedNode is made covariant. In addition, method insertNode has type A as its lower type bound because Function1 is contravariant over its parameter type.
请注意, LinkedNode 是协变的。此外,方法 insertNode 将类型 A 作为其类型下限,因为 Function1 是其参数类型的 contravariant

Recursion and pattern matching
递归和模式匹配

A couple of notes on the approach we implement our class methods with:
关于我们实现类方法的方法的一些说明:

  1. We use recursive functions to avoid using of mutable variables. They should be made tail-recursive for optimal performance, but that isn’t the focus of this implementation. If performance is a priority, using conventional while-loops with mutable class fields elem/next would be a more practical option.
    我们使用递归函数来避免使用可变变量。它们应该被设置为尾递归以获得最佳性能,但这不是此实现的重点。如果性能是重中之重,那么使用带有可变类字段 elem / next 的传统 while 循环将是一个更实用的选择。
  2. Pattern matching is routinely used for handling cases of Node versus EmptyNode. An alternative approach would be to define fields elem and next in the base trait and implement class methods accordingly within Node and EmptyNode.
    模式匹配通常用于处理 NodeEmptyNode 的情况。另一种方法是在基本特征中定义字段 elemnext ,并在 NodeEmptyNode 中相应地实现类方法。

Finding first/last matching nodes
查找第一个/最后一个匹配节点

Next, we add a couple of class methods for finding first/last matching nodes.
接下来,我们添加几个类方法来查找第一个/最后一个匹配节点。

Reversing a linked list by groups of nodes
按节点组反转链表

Reversing a given LinkedNode can be accomplished via recursion by cumulatively wrapping the element of each Node in a new Node with its next pointer set to the Node created in the previous iteration.
反转给定的 LinkedNode 可以通过递归来完成,方法是将每个 Node 的元素累积包装在一个新的 Node 中,其 next 指针设置为在上一次迭代中创建的 Node

Method reverseK reverses a LinkedNode by groups of k elements using a different approach that extracts the elements into groups of k elements, reverses the elements in each of the groups and re-wraps each of the flattened elements in a Node.
方法 reverseK 使用不同的方法通过 k 元素组反转 LinkedNode,该方法将元素提取为 k 元素组,反转每个组中的元素,并将每个平展元素重新包装在 Node 中。

Using LinkedNode as a Stack
使用 LinkedNode 作为堆栈

For LinkedNode to serve as a Stack, we include simple methods push and pop as follows:
对于 LinkedNode 作为堆栈,我们包括简单的方法 pushpop ,如下所示:

Addendum: implementing map, flatMap, fold
附录:实现地图,平面地图,折叠

From a different perspective, if we view LinkedNode as a collection like a Scala List or Vector, we might be craving for methods like map, flatMap, fold. Using the same approach of recursion along with pattern matching, it’s rather straight forward to crank them out.
从另一个角度来看,如果我们将 LinkedNode 视为像 Scala List 或 Vector 这样的集合,我们可能会渴望像 mapflatMapfold 这样的方法。使用相同的递归方法以及模式匹配,将它们分解是相当直接的。

Putting everything together
将所有内容放在一起

Along with a few additional simple class methods and a factory method wrapped in LinkedNode‘s companion object, below is the final LinkedNode ADT that includes everything described above.
除了一些额外的简单类方法和包装在 LinkedNode 的配套对象中的工厂方法外,下面是包含上述所有内容的最终 LinkedNode ADT。

A cursory test-run …
粗略的试运行...

1 thought on “Implementing Linked List In Scala
关于“在 Scala 中实现链表”的 1 条思考

  1. Pingback: Scala Binary Search Tree | Genuine Blog
    pingback: Scala 二叉搜索树 |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Stream Stateful MapConcat

If you’ve been building applications with Akka Stream in Scala, you would probably have used mapConcat (and perhaps flatMapConcat as well). It’s a handy method for expanding and flattening content of a Stream, much like how flatMap operates on an ordinary Scala collection. The method has the following signature:
如果你一直在Scala中使用Akka Stream构建应用程序,你可能会使用mapConcat(也许还有flatMapConcat)。这是一种扩展和扁平化 Stream 内容的便捷方法,就像 flatMap 在普通 Scala 集合上的操作方式一样。该方法具有以下签名:

Here’s a trivial example using mapConcat:
下面是一个使用 mapConcat 的简单示例:

A mapConcat with an internal state
具有内部状态的映射Concat

A relatively less popular method that allows one to expand and flatten Stream elements while iteratively processing some internal state is statefulMapConcat, with method signature as follows:
一种相对不太流行的方法,允许在迭代处理某些内部状态的同时扩展和展平 Stream 元素,方法是 statefulMapConcat ,方法签名如下:

Interestingly, method mapConcat is just a parametrically restricted version of method statefulMapConcat. Here’s how mapConcat[T] is implemented in Akka Stream Flow:
有趣的是,方法 mapConcat 只是方法 statefulMapConcat 的参数限制版本。以下是 mapConcat[T] 在 Akka Stream Flow 中的实现方式:

Example 1: Extracting sections of elements
示例 1:提取元素部分

Let’s look at a simple example that illustrates how statefulMapConcat can be used to extract sections of a given Source in accordance with special elements designated for section-start / stop.
让我们看一个简单的例子,说明如何使用 statefulMapConcat 根据为部分开始/停止指定的特殊元素提取给定源的部分。

The internal state in the above example is the mutable Boolean variable discard being toggled in accordance with the designated start/stop element to either return an empty Iterable (in this case, Nil) or an Iterable consisting the element in a given iteration.
上面示例中的内部状态是可变布尔变量 discard ,根据指定的开始/停止元素进行切换,以返回空的可迭代对象(在本例中为 Nil )或由给定迭代中的元素组成的可迭代对象。

Example 2: Conditional element-wise pairing of streams
示例 2:流的条件元素配对

Next, we look at a slightly more complex example. Say, we have two Sources of integer elements and we would like to pair up the elements from the two Sources based on some condition provided as a (Int, Int) => Boolean function.
接下来,我们看一个稍微复杂一些的例子。假设,我们有两个整数元素源,我们希望根据作为 (Int, Int) => Boolean 函数提供的某些条件配对来自两个源的元素。

In the main method ConditionalZip, a couple of Lists are maintained for the two Stream Sources to keep track of elements held off in previous iterations to be conditionally consumed in subsequent iterations based on the provided condition function.
在 main 方法 ConditionalZip 中,为两个流源维护了几个列表,以跟踪在以前的迭代中保留的元素,以便在基于提供的条件函数的后续迭代中有条件地使用。

Utility method popFirstMatch is for extracting the first element in a List that satisfies the condition derived from the condition function. It also returns the resulting List consisting of the remaining elements.
实用程序方法 popFirstMatch 用于提取 List 中满足从条件函数派生的条件的第一个元素。它还返回由其余元素组成的结果列表。

Note that the filler elements are for method zipAll (available on Akka Stream 2.6+) to cover all elements in the “bigger” Stream Source of the two. The provided filler value should be distinguishable from the Stream elements (Int.Minvalue in this example) so that the condition logic can be applied accordingly.
请注意, filler 元素用于方法 zipAll (在 Akka Stream 2.6+ 上可用),以涵盖两者的“更大”流源中的所有元素。提供的 filler 值应与 Stream 元素(在本例中为 Int.Minvalue )区分开来,以便可以相应地应用条件逻辑。

Test running ConditionalZip:
测试运行 ConditionalZip

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Merging Akka Streams With MergeLatest
将 Akka 流与合并最新

Akka Stream comes with a comprehensive set of fan-in/fan-out features for stream processing. It’s worth noting that rather than as substreams, fan-in/fan-out operations take regular streams as input and generate regular streams as output. These operations are different from substreaming which produces nested SubSource or SubFlow instances with operators like groupBy which, in turn, can be merged back into a regular stream via functions like mergeSubstreams.
Akka Stream 带有一套全面的扇入/扇出功能,用于流处理。值得注意的是,扇入/扇出操作不是作为子流,而是将常规流作为输入,并生成常规流作为输出。这些操作不同于子流,子流生成嵌套的 SubSource 或带有 groupBy 等运算符的 SubFlow 实例,而这些实例又可以通过 mergeSubstreams 等函数合并回常规流。

Fan-in: Zip versus Merge
扇入:压缩与合并

For fan-in functionalities, they primarily belong to two types of operations: Zip and Merge. One of the main differences between the them is that Zip may combine streams of different element types to generate a stream of tuple-typed elements whereas Merge takes streams of same type and generates a stream of elements (or a stream of collections of elements). Another difference is that the resulting stream emits when each of the input streams has an element for Zip; as opposed to emitting as soon as any one of the input streams has an element for Merge.
对于扇入功能,它们主要属于两种类型的操作:压缩和合并。它们之间的主要区别之一是 Zip 可以组合不同元素类型的流以生成元组类型元素流,而 Merge 采用相同类型的流并生成元素流(或元素集合流)。另一个区别是,当每个输入流都有一个 Zip 元素时,生成的流会发出;而不是在任何一个输入流具有合并元素时立即发出。

Starting v2.6, Akka Stream introduces a few additional flavors of Merge functions such as mergeLatest, mergePreferred, mergePrioritized. In this blog post, we’re going to focus on Merge, in particular, mergeLatest which, unlike most other Merge functions, generates a list of elements for each element emitted from any of the input streams.
从 v2.6 开始,Akka Stream 引入了一些额外的合并函数,例如 mergeLatestmergePreferredmergePrioritized 。在这篇博文中,我们将重点介绍 Merge,特别是 mergeLatest ,与大多数其他 Merge 函数不同,它为从任何输入流发出的每个元素生成元素列表。

MergeLatest

Function mergeLatest takes a couple of parameters: inputPorts which is the number of input streams and eagerClose which specifics whether the stream completes when all upstreams complete (false) or one upstream completes (true).
函数 mergeLatest 采用几个参数: inputPorts 是输入流的数量, eagerClose 是指定流是在所有上游完成时完成(false) 还是一个上游完成(true)。

Let’s try it out using Source.combine, which takes two or more Sources and apply the provided uniform fan-in operator (in this case, MergeLatest):
让我们使用 Source.combine 尝试一下,它需要两个或多个 Sources 并应用提供的统一扇入运算符(在本例中为 MergeLatest ):

import akka.stream.scaladsl._
import akka.actor.ActorSystem
implicit val system = ActorSystem("system")
val s1 = Source(1 to 3)
val s2 = Source(11 to 13).throttle(1, 50.millis)
val s3 = Source(101 to 103).throttle(1, 100.millis)
// Source.combine(s1, s2, s3)(Merge[Int](_)).runForeach(println) // Ordinary Merge
Source.combine(s1, s2, s3)(MergeLatest[Int](_, 0)).runForeach(println)
// Output:
//
// List(1, 11, 101)
// List(2, 11, 101)
// List(2, 12, 101)
// List(3, 12, 101)
// List(3, 13, 101)
// List(3, 13, 102)
// List(3, 13, 103)
import akka.stream.scaladsl._ import akka.actor.ActorSystem implicit val system = ActorSystem("system") val s1 = Source(1 to 3) val s2 = Source(11 to 13).throttle(1, 50.millis) val s3 = Source(101 to 103).throttle(1, 100.millis) // Source.combine(s1, s2, s3)(Merge[Int](_)).runForeach(println) // Ordinary Merge Source.combine(s1, s2, s3)(MergeLatest[Int](_, 0)).runForeach(println) // Output: // // List(1, 11, 101) // List(2, 11, 101) // List(2, 12, 101) // List(3, 12, 101) // List(3, 13, 101) // List(3, 13, 102) // List(3, 13, 103)
import akka.stream.scaladsl._
import akka.actor.ActorSystem

implicit val system = ActorSystem("system")

val s1 = Source(1 to 3)
val s2 = Source(11 to 13).throttle(1, 50.millis)
val s3 = Source(101 to 103).throttle(1, 100.millis)

// Source.combine(s1, s2, s3)(Merge[Int](_)).runForeach(println)  // Ordinary Merge
Source.combine(s1, s2, s3)(MergeLatest[Int](_, 0)).runForeach(println)

// Output: 
//
// List(1, 11, 101)
// List(2, 11, 101)
// List(2, 12, 101)
// List(3, 12, 101)
// List(3, 13, 101)
// List(3, 13, 102)
// List(3, 13, 103)

For comparison, had MergeLatest been replaced with the ordinary Merge, the output would be like this:
为了进行比较,如果将 MergeLatest 替换为普通的 Merge ,则输出将如下所示:

// Output:
//
// 1
// 11
// 101
// 2
// 12
// 3
// 13
// 102
// 103
// Output: // // 1 // 11 // 101 // 2 // 12 // 3 // 13 // 102 // 103
// Output:
//
// 1
// 11
// 101
// 2
// 12
// 3
// 13
// 102
// 103

As can be seen from Akka Stream’s Flow source code, mergeLatest uses the stream processing operator MergeLatest for the special case of 2 input streams:
从 Akka Flow 的源代码中可以看出, mergeLatest 使用流处理运算符 MergeLatest 来表示 2 个输入流的特殊情况:

def mergeLatest[U >: Out, M](that: Graph[SourceShape[U], M], eagerComplete: Boolean = false): Repr[immutable.Seq[U]] =
via(mergeLatestGraph(that, eagerComplete))
protected def mergeLatestGraph[U >: Out, M](
that: Graph[SourceShape[U], M],
eagerComplete: Boolean): Graph[FlowShape[Out @uncheckedVariance, immutable.Seq[U]], M] =
GraphDSL.create(that) { implicit b => r =>
val merge = b.add(MergeLatest[U](2, eagerComplete))
r ~> merge.in(1)
FlowShape(merge.in(0), merge.out)
}
def mergeLatest[U >: Out, M](that: Graph[SourceShape[U], M], eagerComplete: Boolean = false): Repr[immutable.Seq[U]] = via(mergeLatestGraph(that, eagerComplete)) protected def mergeLatestGraph[U >: Out, M]( that: Graph[SourceShape[U], M], eagerComplete: Boolean): Graph[FlowShape[Out @uncheckedVariance, immutable.Seq[U]], M] = GraphDSL.create(that) { implicit b => r => val merge = b.add(MergeLatest[U](2, eagerComplete)) r ~> merge.in(1) FlowShape(merge.in(0), merge.out) }
def mergeLatest[U >: Out, M](that: Graph[SourceShape[U], M], eagerComplete: Boolean = false): Repr[immutable.Seq[U]] =
  via(mergeLatestGraph(that, eagerComplete))

protected def mergeLatestGraph[U >: Out, M](
    that: Graph[SourceShape[U], M],
      eagerComplete: Boolean): Graph[FlowShape[Out @uncheckedVariance, immutable.Seq[U]], M] =
  GraphDSL.create(that) { implicit b => r =>
    val merge = b.add(MergeLatest[U](2, eagerComplete))
    r ~> merge.in(1)
    FlowShape(merge.in(0), merge.out)
  }

And below is how the MergeLatest operator is implemented:
以下是 MergeLatest 运算符的实现方式:

object MergeLatest {
def apply[T](inputPorts: Int, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] =
new MergeLatest[T, List[T]](inputPorts, eagerComplete)(_.toList)
}
final class MergeLatest[T, M](val inputPorts: Int, val eagerClose: Boolean)(buildElem: Array[T] => M)
extends GraphStage[UniformFanInShape[T, M]] {
require(inputPorts >= 1, "input ports must be >= 1")
val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatest.in" + i))
val out: Outlet[M] = Outlet[M]("MergeLatest.out")
override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler {
private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]()
private var runningUpstreams: Int = inputPorts
private def upstreamsClosed: Boolean = runningUpstreams == 0
private def allMessagesReady: Boolean = activeStreams.size == inputPorts
private val messages: Array[Any] = new Array[Any](inputPorts)
override def preStart(): Unit = in.foreach(tryPull)
in.zipWithIndex.foreach {
case (input, index) =>
setHandler(
input,
new InHandler {
override def onPush(): Unit = {
messages.update(index, grab(input))
activeStreams.add(index)
if (allMessagesReady) emit(out, buildElem(messages.asInstanceOf[Array[T]]))
tryPull(input)
}
override def onUpstreamFinish(): Unit = {
if (!eagerClose) {
runningUpstreams -= 1
if (upstreamsClosed) completeStage()
} else completeStage()
}
})
}
override def onPull(): Unit = {
var i = 0
while (i < inputPorts) {
if (!hasBeenPulled(in(i))) tryPull(in(i))
i += 1
}
}
setHandler(out, this)
}
override def toString = "MergeLatest"
}
object MergeLatest { def apply[T](inputPorts: Int, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] = new MergeLatest[T, List[T]](inputPorts, eagerComplete)(_.toList) } final class MergeLatest[T, M](val inputPorts: Int, val eagerClose: Boolean)(buildElem: Array[T] => M) extends GraphStage[UniformFanInShape[T, M]] { require(inputPorts >= 1, "input ports must be >= 1") val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatest.in" + i)) val out: Outlet[M] = Outlet[M]("MergeLatest.out") override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*) override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) with OutHandler { private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]() private var runningUpstreams: Int = inputPorts private def upstreamsClosed: Boolean = runningUpstreams == 0 private def allMessagesReady: Boolean = activeStreams.size == inputPorts private val messages: Array[Any] = new Array[Any](inputPorts) override def preStart(): Unit = in.foreach(tryPull) in.zipWithIndex.foreach { case (input, index) => setHandler( input, new InHandler { override def onPush(): Unit = { messages.update(index, grab(input)) activeStreams.add(index) if (allMessagesReady) emit(out, buildElem(messages.asInstanceOf[Array[T]])) tryPull(input) } override def onUpstreamFinish(): Unit = { if (!eagerClose) { runningUpstreams -= 1 if (upstreamsClosed) completeStage() } else completeStage() } }) } override def onPull(): Unit = { var i = 0 while (i < inputPorts) { if (!hasBeenPulled(in(i))) tryPull(in(i)) i += 1 } } setHandler(out, this) } override def toString = "MergeLatest" }
object MergeLatest {
  def apply[T](inputPorts: Int, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] =
    new MergeLatest[T, List[T]](inputPorts, eagerComplete)(_.toList)
}

final class MergeLatest[T, M](val inputPorts: Int, val eagerClose: Boolean)(buildElem: Array[T] => M)
    extends GraphStage[UniformFanInShape[T, M]] {
  require(inputPorts >= 1, "input ports must be >= 1")

  val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatest.in" + i))
  val out: Outlet[M] = Outlet[M]("MergeLatest.out")
  override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with OutHandler {
      private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]()
      private var runningUpstreams: Int = inputPorts
      private def upstreamsClosed: Boolean = runningUpstreams == 0
      private def allMessagesReady: Boolean = activeStreams.size == inputPorts
      private val messages: Array[Any] = new Array[Any](inputPorts)

      override def preStart(): Unit = in.foreach(tryPull)

      in.zipWithIndex.foreach {
        case (input, index) =>
          setHandler(
            input,
            new InHandler {
              override def onPush(): Unit = {
                messages.update(index, grab(input))
                activeStreams.add(index)
                if (allMessagesReady) emit(out, buildElem(messages.asInstanceOf[Array[T]]))
                tryPull(input)
              }

              override def onUpstreamFinish(): Unit = {
                if (!eagerClose) {
                  runningUpstreams -= 1
                  if (upstreamsClosed) completeStage()
                } else completeStage()
              }
            })
      }

      override def onPull(): Unit = {
        var i = 0
        while (i < inputPorts) {
          if (!hasBeenPulled(in(i))) tryPull(in(i))
          i += 1
        }
      }

      setHandler(out, this)
    }

  override def toString = "MergeLatest"
}

As shown in the source code, it’s implemented as a standard GraphStage of UniformFanInShape. The implementation is so modular that repurposing it to do something a little differently can be rather easy.
如源代码所示,它被实现为 UniformFanInShape 的标准 GraphStage 。该实现是如此模块化,以至于重新调整其用途以执行一些不同的事情可能相当容易。

Repurposing MergeLatest
重新调整合并用途最新

There was a relevant use case inquiry on Stack Overflow to which I offered a solution for changing the initial stream emission behavior. MergeLatest by design starts emitting the output stream only after each input stream has emitted an initial element, which is somewhat an exception to typical Merge behavior as mentioned earlier. The solution I suggested is to revise the operator to change the emission behavior similar to other Merge operators — i.e. start emitting as soon as one of the input streams has an element by filling in the rest with a user-provided default element.
有一个关于堆栈溢出的相关用例查询,我提供了一个更改初始流发射行为的解决方案。 根据设计, MergeLatest 仅在每个输入流发出初始元素后才开始发出输出流,这与前面提到的典型 Merge 行为有些例外。我建议的解决方案是修改运算符以更改类似于其他 Merge 运算符的发射行为——即,一旦其中一个输入流具有元素,就会开始发射,方法是用用户提供的默认元素填充其余部分。

Below is the repurposed code:
以下是重新调整用途的代码:

import akka.stream.scaladsl._
import akka.stream.stage.{ GraphStage, GraphStageLogic, InHandler, OutHandler }
import akka.stream.{ Attributes, Inlet, Outlet, UniformFanInShape }
import scala.collection.immutable
object MergeLatestWithDefault {
def apply[T](inputPorts: Int, default: T, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] =
new MergeLatestWithDefault[T, List[T]](inputPorts, default, eagerComplete)(_.toList)
}
final class MergeLatestWithDefault[T, M](val inputPorts: Int, val default: T, val eagerClose: Boolean)(buildElem: Array[T] => M)
extends GraphStage[UniformFanInShape[T, M]] {
require(inputPorts >= 1, "input ports must be >= 1")
val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatestWithDefault.in" + i))
val out: Outlet[M] = Outlet[M]("MergeLatestWithDefault.out")
override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler {
private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]()
private var runningUpstreams: Int = inputPorts
private def upstreamsClosed: Boolean = runningUpstreams == 0
private val messages: Array[Any] = Array.fill[Any](inputPorts)(default)
override def preStart(): Unit = in.foreach(tryPull)
in.zipWithIndex.foreach {
case (input, index) =>
setHandler(
input,
new InHandler {
override def onPush(): Unit = {
messages.update(index, grab(input))
activeStreams.add(index)
emit(out, buildElem(messages.asInstanceOf[Array[T]]))
tryPull(input)
}
override def onUpstreamFinish(): Unit = {
if (!eagerClose) {
runningUpstreams -= 1
if (upstreamsClosed) completeStage()
} else completeStage()
}
})
}
override def onPull(): Unit = {
var i = 0
while (i < inputPorts) {
if (!hasBeenPulled(in(i))) tryPull(in(i))
i += 1
}
}
setHandler(out, this)
}
override def toString = "MergeLatestWithDefault"
}
import akka.stream.scaladsl._ import akka.stream.stage.{ GraphStage, GraphStageLogic, InHandler, OutHandler } import akka.stream.{ Attributes, Inlet, Outlet, UniformFanInShape } import scala.collection.immutable object MergeLatestWithDefault { def apply[T](inputPorts: Int, default: T, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] = new MergeLatestWithDefault[T, List[T]](inputPorts, default, eagerComplete)(_.toList) } final class MergeLatestWithDefault[T, M](val inputPorts: Int, val default: T, val eagerClose: Boolean)(buildElem: Array[T] => M) extends GraphStage[UniformFanInShape[T, M]] { require(inputPorts >= 1, "input ports must be >= 1") val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatestWithDefault.in" + i)) val out: Outlet[M] = Outlet[M]("MergeLatestWithDefault.out") override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*) override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) with OutHandler { private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]() private var runningUpstreams: Int = inputPorts private def upstreamsClosed: Boolean = runningUpstreams == 0 private val messages: Array[Any] = Array.fill[Any](inputPorts)(default) override def preStart(): Unit = in.foreach(tryPull) in.zipWithIndex.foreach { case (input, index) => setHandler( input, new InHandler { override def onPush(): Unit = { messages.update(index, grab(input)) activeStreams.add(index) emit(out, buildElem(messages.asInstanceOf[Array[T]])) tryPull(input) } override def onUpstreamFinish(): Unit = { if (!eagerClose) { runningUpstreams -= 1 if (upstreamsClosed) completeStage() } else completeStage() } }) } override def onPull(): Unit = { var i = 0 while (i < inputPorts) { if (!hasBeenPulled(in(i))) tryPull(in(i)) i += 1 } } setHandler(out, this) } override def toString = "MergeLatestWithDefault" }
import akka.stream.scaladsl._
import akka.stream.stage.{ GraphStage, GraphStageLogic, InHandler, OutHandler }
import akka.stream.{ Attributes, Inlet, Outlet, UniformFanInShape }
import scala.collection.immutable

object MergeLatestWithDefault {
  def apply[T](inputPorts: Int, default: T, eagerComplete: Boolean = false): GraphStage[UniformFanInShape[T, List[T]]] =
	new MergeLatestWithDefault[T, List[T]](inputPorts, default, eagerComplete)(_.toList)
}

final class MergeLatestWithDefault[T, M](val inputPorts: Int, val default: T, val eagerClose: Boolean)(buildElem: Array[T] => M)
	extends GraphStage[UniformFanInShape[T, M]] {
  require(inputPorts >= 1, "input ports must be >= 1")

  val in: immutable.IndexedSeq[Inlet[T]] = Vector.tabulate(inputPorts)(i => Inlet[T]("MergeLatestWithDefault.in" + i))
  val out: Outlet[M] = Outlet[M]("MergeLatestWithDefault.out")
  override val shape: UniformFanInShape[T, M] = UniformFanInShape(out, in: _*)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
	new GraphStageLogic(shape) with OutHandler {
	  private val activeStreams: java.util.HashSet[Int] = new java.util.HashSet[Int]()
	  private var runningUpstreams: Int = inputPorts
	  private def upstreamsClosed: Boolean = runningUpstreams == 0
	  private val messages: Array[Any] = Array.fill[Any](inputPorts)(default)

	  override def preStart(): Unit = in.foreach(tryPull)

	  in.zipWithIndex.foreach {
		case (input, index) =>
		  setHandler(
			input,
			new InHandler {
			  override def onPush(): Unit = {
				messages.update(index, grab(input))
				activeStreams.add(index)
				emit(out, buildElem(messages.asInstanceOf[Array[T]]))
				tryPull(input)
			  }

			  override def onUpstreamFinish(): Unit = {
				if (!eagerClose) {
				  runningUpstreams -= 1
				  if (upstreamsClosed) completeStage()
				} else completeStage()
			  }
			})
	  }

	  override def onPull(): Unit = {
		var i = 0
		while (i < inputPorts) {
		  if (!hasBeenPulled(in(i))) tryPull(in(i))
		  i += 1
		}
	  }

	  setHandler(out, this)
	}

  override def toString = "MergeLatestWithDefault"
}

Little code change is necessary in this case. Besides an additional parameter for the default element value to be pre-filled in an internal array, the only change is that emit within onPush within the InHandler is no longer conditional.
在这种情况下,几乎不需要更改代码。除了要在内部数组中预填充 default 元素值的附加参数外,唯一的更改是 InHandler 中的 onPush 中的 emit 不再是有条件的。

Testing it out:
测试一下:

import akka.stream.scaladsl._
import akka.actor.ActorSystem
implicit val system = ActorSystem("system")
val s1 = Source(1 to 3)
val s2 = Source(11 to 13).throttle(1, 50.millis)
val s3 = Source(101 to 103).throttle(1, 100.millis)
Source.combine(s1, s2, s3)(MergeLatestWithDefault[Int](_, 0)).runForeach(println)
// Output:
//
// List(1, 0, 0)
// List(1, 11, 0)
// List(1, 11, 101)
// List(2, 11, 101)
// List(2, 12, 101)
// List(3, 12, 101)
// List(3, 13, 101)
// List(3, 13, 102)
// List(3, 13, 103)
import akka.stream.scaladsl._ import akka.actor.ActorSystem implicit val system = ActorSystem("system") val s1 = Source(1 to 3) val s2 = Source(11 to 13).throttle(1, 50.millis) val s3 = Source(101 to 103).throttle(1, 100.millis) Source.combine(s1, s2, s3)(MergeLatestWithDefault[Int](_, 0)).runForeach(println) // Output: // // List(1, 0, 0) // List(1, 11, 0) // List(1, 11, 101) // List(2, 11, 101) // List(2, 12, 101) // List(3, 12, 101) // List(3, 13, 101) // List(3, 13, 102) // List(3, 13, 103)
import akka.stream.scaladsl._
import akka.actor.ActorSystem

implicit val system = ActorSystem("system")

val s1 = Source(1 to 3)
val s2 = Source(11 to 13).throttle(1, 50.millis)
val s3 = Source(101 to 103).throttle(1, 100.millis)

Source.combine(s1, s2, s3)(MergeLatestWithDefault[Int](_, 0)).runForeach(println)

// Output: 
//
// List(1, 0, 0)
// List(1, 11, 0)
// List(1, 11, 101)
// List(2, 11, 101)
// List(2, 12, 101)
// List(3, 12, 101)
// List(3, 13, 101)
// List(3, 13, 102)
// List(3, 13, 103)

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Spark Higher-order Functions
火花高阶函数

Apache Spark’s DataFrame API provides comprehensive functions for transforming or aggregating data in a row-wise fashion. Like many popular relational database systems such as PostgreSQL, these functions are internally optimized to efficiently process large number of rows. Better yet, Spark runs on distributed platforms and if configured to fully utilize available processing cores and memory, it can be handling data at really large scale.
Apache Spark 的 DataFrame API 提供了全面的功能,用于以逐行方式转换或聚合数据。与许多流行的关系数据库系统(如PostgreSQL)一样,这些函数在内部进行了优化,可以有效地处理大量行。更好的是,Spark在分布式平台上运行,如果配置为充分利用可用的处理内核和内存,它可以大规模处理数据。

That’s all great, but what about transforming or aggregating data of same type column-wise? Starting from Spark 2.4, a number of methods for ArrayType (and MapType) columns have been added. But users still feel hand-tied when none of the available methods can deal with something as simple as, say, summing the integer elements of an array.
这一切都很好,但是按列转换或聚合相同类型的数据呢?从 Spark 2.4 开始,为 ArrayType (和 MapType )列添加了许多方法。但是,当没有可用的方法可以处理像对数组的整数元素求和这样简单的事情时,用户仍然感到手束手束脚。

User-provided lambda functions
用户提供的 lambda 函数

A higher-order function allows one to process a collection of elements (of the same data type) in accordance with a user-provided lambda function to specify how the collection content should be transformed or aggregated. The lambda function being part of the function signature makes it possible to process the collection of elements with relatively complex processing logic.
高阶函数允许根据用户提供的 lambda 函数处理元素集合(数据类型相同),以指定应如何转换或聚合集合内容。lambda 函数是函数签名的一部分,因此可以处理具有相对复杂处理逻辑的元素集合。

Coupled with the using of method array, higher-order functions are particularly useful when transformation or aggregation across a list of columns (of the same data type) is needed. Below are a few of such functions:
结合方法 array 的使用,当需要跨列列表(相同数据类型)进行转换或聚合时,高阶函数特别有用。以下是其中的一些功能:

  • filter()
  • exists()
  • transform()
  • aggregate()

The lambda function could either be a unary or binary operator. As will be shown in examples below, function aggregate() requires a binary operator whereas the other functions expect a unary operator.
lambda 函数可以是 unarybinary 运算符。如下面的示例所示,函数 aggregate() 需要一个二元运算符,而其他函数需要一个一元运算符。

A caveat

Unless you’re on Spark 3.x, higher-order functions aren’t part of Spark 2.4’s built-in DataFrame API. They are expressed in standard SQL syntax along with a lambda function and need to be passed in as a String via expr(). Hence, to use these functions, one would need to temporarily “exit” the Scala world to assemble proper SQL expressions in the SQL arena.
除非你使用的是Spark 3.x,否则高阶函数不是Spark 2.4内置DataFrame API的一部分。它们与 lambda 函数一起以标准 SQL 语法表示,需要通过 expr() 作为 String 传入。因此,要使用这些功能,需要暂时“退出”Scala世界,以便在SQL领域组装适当的SQL表达式。

Let’s create a simple DataFrame for illustrating how these higher-order functions work.
让我们创建一个简单的数据帧来说明这些高阶函数的工作原理。

case class Order(price: Double, qty: Int)
val df = Seq(
(101, 10, Order(1.2, 5), Order(1.0, 3), Order(1.5, 4), Seq("strawberry", "currant")),
(102, 15, Order(1.5, 6), Order(0.8, 5), Order(1.0, 7), Seq("raspberry", "cherry", "blueberry"))
).toDF("id", "discount", "order1", "order2", "order3", "fruits")
df.show(false)
// +---+--------+--------+--------+--------+------------------------------+
// |id |discount|order1 |order2 |order3 |fruits |
// +---+--------+--------+--------+--------+------------------------------+
// |101|10 |[1.2, 5]|[1.0, 3]|[1.5, 4]|[strawberry, currant] |
// |102|15 |[1.5, 6]|[0.8, 5]|[1.0, 7]|[raspberry, cherry, blueberry]|
// +---+--------+--------+--------+--------+------------------------------+
case class Order(price: Double, qty: Int) val df = Seq( (101, 10, Order(1.2, 5), Order(1.0, 3), Order(1.5, 4), Seq("strawberry", "currant")), (102, 15, Order(1.5, 6), Order(0.8, 5), Order(1.0, 7), Seq("raspberry", "cherry", "blueberry")) ).toDF("id", "discount", "order1", "order2", "order3", "fruits") df.show(false) // +---+--------+--------+--------+--------+------------------------------+ // |id |discount|order1 |order2 |order3 |fruits | // +---+--------+--------+--------+--------+------------------------------+ // |101|10 |[1.2, 5]|[1.0, 3]|[1.5, 4]|[strawberry, currant] | // |102|15 |[1.5, 6]|[0.8, 5]|[1.0, 7]|[raspberry, cherry, blueberry]| // +---+--------+--------+--------+--------+------------------------------+
case class Order(price: Double, qty: Int)

val df = Seq(
  (101, 10, Order(1.2, 5), Order(1.0, 3), Order(1.5, 4), Seq("strawberry", "currant")),
  (102, 15, Order(1.5, 6), Order(0.8, 5), Order(1.0, 7), Seq("raspberry", "cherry", "blueberry"))
).toDF("id", "discount", "order1", "order2", "order3", "fruits")

df.show(false)
// +---+--------+--------+--------+--------+------------------------------+
// |id |discount|order1  |order2  |order3  |fruits                        |
// +---+--------+--------+--------+--------+------------------------------+
// |101|10      |[1.2, 5]|[1.0, 3]|[1.5, 4]|[strawberry, currant]         |
// |102|15      |[1.5, 6]|[0.8, 5]|[1.0, 7]|[raspberry, cherry, blueberry]|
// +---+--------+--------+--------+--------+------------------------------+

Function filter()

Here’s an “unofficial” method signature of of filter():
这是 filter() 的“非官方”方法签名:

// Scala-style signature of `filter()`
def filter[T](arrayCol: ArrayType[T], fcn: T => Boolean): ArrayType[T]
// Scala-style signature of `filter()` def filter[T](arrayCol: ArrayType[T], fcn: T => Boolean): ArrayType[T]
// Scala-style signature of `filter()`
def filter[T](arrayCol: ArrayType[T], fcn: T => Boolean): ArrayType[T]

The following snippet uses filter to extract any fruit item that ends with “berry”.
以下代码片段使用 filter 提取以“浆果”结尾的任何水果项。

df.
withColumn("berries", expr("filter(fruits, x -> x rlike '.*berry')")).
select("id", "fruits", "berries").
show(false)
// +---+------------------------------+----------------------+
// |id |fruits |berries |
// +---+------------------------------+----------------------+
// |101|[strawberry, currant] |[strawberry] |
// |102|[raspberry, cherry, blueberry]|[raspberry, blueberry]|
// +---+------------------------------+----------------------+
df. withColumn("berries", expr("filter(fruits, x -> x rlike '.*berry')")). select("id", "fruits", "berries"). show(false) // +---+------------------------------+----------------------+ // |id |fruits |berries | // +---+------------------------------+----------------------+ // |101|[strawberry, currant] |[strawberry] | // |102|[raspberry, cherry, blueberry]|[raspberry, blueberry]| // +---+------------------------------+----------------------+
df.
  withColumn("berries", expr("filter(fruits, x -> x rlike '.*berry')")).
  select("id", "fruits", "berries").
  show(false)
// +---+------------------------------+----------------------+
// |id |fruits                        |berries               |
// +---+------------------------------+----------------------+
// |101|[strawberry, currant]         |[strawberry]          |
// |102|[raspberry, cherry, blueberry]|[raspberry, blueberry]|
// +---+------------------------------+----------------------+

Function transform()
函数变换()

Method signature (unofficial) of transform():
transform() 的方法签名(非官方):

// Scala-style signature of `transform()`
def transform[T, S](arrayCol: ArrayType[T], fcn: T => S): ArrayType[S]
// Scala-style signature of `transform()` def transform[T, S](arrayCol: ArrayType[T], fcn: T => S): ArrayType[S]
// Scala-style signature of `transform()`
def transform[T, S](arrayCol: ArrayType[T], fcn: T => S): ArrayType[S]

Here’s an example of using transform() to flag any fruit not ending with “berry” with an ‘*’.
下面是使用 transform() 标记任何不以“浆果”结尾的水果和“*”的示例。

df.withColumn(
"non-berries",
expr("transform(fruits, x -> case when x rlike '.*berry' then x else concat(x, '*') end)")
).
select("id", "fruits", "non-berries").
show(false)
// +---+------------------------------+-------------------------------+
// |id |fruits |non-berries |
// +---+------------------------------+-------------------------------+
// |101|[strawberry, currant] |[strawberry, currant*] |
// |102|[raspberry, cherry, blueberry]|[raspberry, cherry*, blueberry]|
// +---+------------------------------+-------------------------------+
df.withColumn( "non-berries", expr("transform(fruits, x -> case when x rlike '.*berry' then x else concat(x, '*') end)") ). select("id", "fruits", "non-berries"). show(false) // +---+------------------------------+-------------------------------+ // |id |fruits |non-berries | // +---+------------------------------+-------------------------------+ // |101|[strawberry, currant] |[strawberry, currant*] | // |102|[raspberry, cherry, blueberry]|[raspberry, cherry*, blueberry]| // +---+------------------------------+-------------------------------+
df.withColumn(
    "non-berries",
    expr("transform(fruits, x -> case when x rlike '.*berry' then x else concat(x, '*') end)")
  ).
  select("id", "fruits", "non-berries").
  show(false)
// +---+------------------------------+-------------------------------+
// |id |fruits                        |non-berries                    |
// +---+------------------------------+-------------------------------+
// |101|[strawberry, currant]         |[strawberry, currant*]         |
// |102|[raspberry, cherry, blueberry]|[raspberry, cherry*, blueberry]|
// +---+------------------------------+-------------------------------+

So far, we’ve seen how higher-order functions transform data in an ArrayType collection. For the following examples, we’ll illustrate applying the higher-order functions to individual columns (of same data type) by first turning selected columns into a single ArrayType column.
到目前为止,我们已经看到了高阶函数如何转换 ArrayType 集合中的数据。对于以下示例,我们将通过首先将所选列转换为单个 ArrayType 列来说明如何将高阶函数应用于单个列(具有相同数据类型)。

Let’s assemble an array of the individual columns we would like to process across:
让我们组装一个我们想要处理的各个列的数组:

val orderCols = df.columns.filter{
c => "^order\\d+$".r.findFirstIn(c).nonEmpty
}
// orderCols: Array[String] = Array(order1, order2, order3)
val orderCols = df.columns.filter{ c => "^order\\d+$".r.findFirstIn(c).nonEmpty } // orderCols: Array[String] = Array(order1, order2, order3)
val orderCols = df.columns.filter{
  c => "^order\\d+$".r.findFirstIn(c).nonEmpty
}
// orderCols: Array[String] = Array(order1, order2, order3)

Function exists()

Method signature (unofficial) of exists():
存在() 的方法签名(非官方):

// Scala-style signature of `exists()`
def exists[T](arrayCol: ArrayType[T], fcn: T => Boolean): Boolean
// Scala-style signature of `exists()` def exists[T](arrayCol: ArrayType[T], fcn: T => Boolean): Boolean
// Scala-style signature of `exists()`
def exists[T](arrayCol: ArrayType[T], fcn: T => Boolean): Boolean

An example using exists() to check whether any of the individual orders per row consists of item price below $1.
使用 exists() 检查每行是否有任何单个订单包含低于 $1 的商品价格的示例。

df.
withColumn("orders", array(orderCols.map(col): _*)).
withColumn("sub$-prices", expr("exists(orders, x -> x.price < 1)")).
select("id", "orders", "sub$-prices").
show(false)
// +---+------------------------------+-----------+
// |id |orders |sub$-prices|
// +---+------------------------------+-----------+
// |101|[[1.2, 5], [1.0, 3], [1.5, 4]]|false |
// |102|[[1.5, 6], [0.8, 5], [1.0, 7]]|true |
// +---+------------------------------+-----------+
df. withColumn("orders", array(orderCols.map(col): _*)). withColumn("sub$-prices", expr("exists(orders, x -> x.price < 1)")). select("id", "orders", "sub$-prices"). show(false) // +---+------------------------------+-----------+ // |id |orders |sub$-prices| // +---+------------------------------+-----------+ // |101|[[1.2, 5], [1.0, 3], [1.5, 4]]|false | // |102|[[1.5, 6], [0.8, 5], [1.0, 7]]|true | // +---+------------------------------+-----------+
df.
  withColumn("orders", array(orderCols.map(col): _*)).
  withColumn("sub$-prices", expr("exists(orders, x -> x.price < 1)")).
  select("id", "orders", "sub$-prices").
  show(false)
// +---+------------------------------+-----------+
// |id |orders                        |sub$-prices|
// +---+------------------------------+-----------+
// |101|[[1.2, 5], [1.0, 3], [1.5, 4]]|false      |
// |102|[[1.5, 6], [0.8, 5], [1.0, 7]]|true       |
// +---+------------------------------+-----------+

Function aggregate()
函数聚合()

Method signature (unofficial) of aggregate():
聚合()的方法签名(非官方):

// Scala-style signature of `aggregate()`
def aggregate[T, S](arrayCol: ArrayType[T], init: S, fcn: (S, T) => S): ArrayType[S]
// Scala-style signature of `aggregate()` def aggregate[T, S](arrayCol: ArrayType[T], init: S, fcn: (S, T) => S): ArrayType[S]
// Scala-style signature of `aggregate()`
def aggregate[T, S](arrayCol: ArrayType[T], init: S, fcn: (S, T) => S): ArrayType[S]

The example below shows how to compute discounted total of all the orders per row using aggregate().
下面的示例显示了如何使用 aggregate() 计算每行所有订单的折扣总额。

df.
withColumn("orders", array(orderCols.map(col): _*)).
withColumn("total", expr("aggregate(orders, 0d, (acc, x) -> acc + x.price * x.qty)")).
withColumn("discounted", $"total" * (lit(1.0) - $"discount"/100.0)).
select("id", "discount", "orders", "total", "discounted").
show(false)
// +---+--------+------------------------------+-----+----------+
// |id |discount|orders |total|discounted|
// +---+--------+------------------------------+-----+----------+
// |101|10 |[[1.2, 5], [1.0, 3], [1.5, 4]]|15.0 |13.5 |
// |102|15 |[[1.5, 6], [0.8, 5], [1.0, 7]]|20.0 |17.0 |
// +---+--------+------------------------------+-----+----------+
df. withColumn("orders", array(orderCols.map(col): _*)). withColumn("total", expr("aggregate(orders, 0d, (acc, x) -> acc + x.price * x.qty)")). withColumn("discounted", $"total" * (lit(1.0) - $"discount"/100.0)). select("id", "discount", "orders", "total", "discounted"). show(false) // +---+--------+------------------------------+-----+----------+ // |id |discount|orders |total|discounted| // +---+--------+------------------------------+-----+----------+ // |101|10 |[[1.2, 5], [1.0, 3], [1.5, 4]]|15.0 |13.5 | // |102|15 |[[1.5, 6], [0.8, 5], [1.0, 7]]|20.0 |17.0 | // +---+--------+------------------------------+-----+----------+
df.
  withColumn("orders", array(orderCols.map(col): _*)).
  withColumn("total", expr("aggregate(orders, 0d, (acc, x) -> acc + x.price * x.qty)")).
  withColumn("discounted", $"total" * (lit(1.0) - $"discount"/100.0)).
  select("id", "discount", "orders", "total", "discounted").
  show(false)
// +---+--------+------------------------------+-----+----------+
// |id |discount|orders                        |total|discounted|
// +---+--------+------------------------------+-----+----------+
// |101|10      |[[1.2, 5], [1.0, 3], [1.5, 4]]|15.0 |13.5      |
// |102|15      |[[1.5, 6], [0.8, 5], [1.0, 7]]|20.0 |17.0      |
// +---+--------+------------------------------+-----+----------+

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

From Akka Untyped To Typed Actors
从阿卡非类型化到类型化演员

This is part 1 of a 3-part blog series about how to migrate an Akka classic actor-based application to one with Akka typed actors. In this post, we’ll focus on how a typical classic actor can be replaced with a typed actor.
这是由 3 部分组成的博客系列的第 1 部分,该系列介绍如何将基于 Akka 经典的基于Actor的应用程序迁移到具有 Akka 类型化 actor 的应用程序。在这篇文章中,我们将重点介绍如何将典型的经典演员替换为类型化的演员。

Akka’s move from its loosely-typed classic actor API to Akka Typed has been met with mixed feelings in the Akka community. On one hand, people are happy with the Actor toolkit being “morphed” into a suitably typed API. On the other hand, the general expectation is that it isn’t going to be a straight forward find-and-replace change.
Akka从松散类型的经典演员API到Akka Typed的转变在Akka社区中遇到了复杂的感觉。一方面,人们对Actor工具包被“变形”为适当类型的API感到满意。另一方面,人们普遍的期望是,这不会是一个直接的查找和替换更改。

Actors interact by means of non-blocking message passing. Each actor maintains a “mailbox” and messages addressed to it get processed in the order of receipt. Akka classic actors are loosely typed mainly because their message processing logic relies on the implementation of the abstract method receive, which has the following signature:
参与者通过非阻塞消息传递进行交互。每个参与者维护一个“邮箱”,发送到它的邮件将按接收顺序进行处理。Akka 经典 actor 是松散类型的,主要是因为它们的消息处理逻辑依赖于抽象方法 receive 的实现,该方法具有以下签名:

// Akka Actor.Receive
type Receive = PartialFunction[Any, Unit]
abstract def receive: Receive
// Akka Actor.Receive type Receive = PartialFunction[Any, Unit] abstract def receive: Receive
// Akka Actor.Receive
type Receive = PartialFunction[Any, Unit]
abstract def receive: Receive

It takes an input of type Any (i.e. allowing messages of any type) and isn’t obligated to return anything. In addition, as a partial function it allows by-design non-exhaustive matching against message types.
它接受 Any 类型的输入(即允许任何类型的消息),并且没有义务返回任何内容。此外,作为部分函数,它允许按设计对消息类型进行非穷举匹配。

Actor “behaviors”
演员“行为”

In Akka Typed, processing logic of the messages received by an actor is defined in methods that return Behavior[T] of a given message type T. A factory object Behaviors provides a number of predefined Behavior[T]s (e.g. Behaviors.same, Behaviors.stopped) and general methods (e.g. Behaviors.receiveMessage()) for user-defined message processing logic.
在 Akka Typed 中,参与者接收的消息的处理逻辑在返回给定消息类型 TBehavior[T] 的方法中定义。工厂对象行为提供了许多预定义的 Behavior[T] s(例如 Behaviors.sameBehaviors.stopped ) 和一般方法(例如 Behaviors.receiveMessage() ) 表示用户定义的消息处理逻辑。

Akka’s official style guide proposes two different flavors of the typed API: functional vs object-oriented. I would highly recommend reading through examples and the pros and cons of the two styles presented there. In brief, the functional approach mutates an actor’s state by successively passing in the state as a parameter to the processing method, whereas the alternative embraces the object-oriented principles and relaxes the use of mutable class variables for state maintenance.
Akka的官方风格指南提出了两种不同风格的类型化API:函数式和面向对象。我强烈建议通读示例以及那里介绍的两种风格的优缺点。简而言之,函数式方法通过将状态作为参数连续传递给处理方法来改变参与者的状态,而替代方法则包含面向对象的原则,并放宽了可变类变量用于状态维护的使用。

An Akka classic actor for blockchain mining
区块链挖矿的阿卡经典演员

Let’s look at a blockchain mining Scala snippet used in an Actor-based cryptocurrency system as an example. The original code written in Akka classic actors is like this:
让我们以基于Actor的加密货币系统中使用的区块链挖掘Scala片段为例。用Akka经典演员写的原始代码是这样的:

// Akka classic actor `Miner`
object Miner {
def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW)
sealed trait Mining
case class Mine(blockPrev: Block, trans: Transactions) extends Mining
case object DoneMining extends Mining
}
class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging {
import Miner._
implicit val ec: ExecutionContext = context.dispatcher
implicit val timeout = timeoutPoW.millis
override def receive: Receive = idle
def idle: Receive = {
case Mine(blockPrev, trans) =>
context.become(busy)
val recipient = sender()
val newBlock = generateNewBlock(blockPrev, trans)
generatePoW(newBlock).map{ newNonce =>
recipient ! newBlock.copy(nonce = newNonce)
}.
recover{ case e: Exception =>
recipient ! Status.Failure(e)
}
case _ =>
// Do nothing
}
def busy: Receive = {
case Mine(b, t) =>
log.error(s"[Mining] Miner.Mine($b, $t) received but $this is busy!")
sender() ! Status.Failure(new Blockchainer.BusyException(s"$this is busy!"))
case DoneMining =>
context.become(idle)
log.info(s"[Mining] Miner.DoneMining received.")
}
private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ???
private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ???
}
// Akka classic actor `Miner` object Miner { def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW) sealed trait Mining case class Mine(blockPrev: Block, trans: Transactions) extends Mining case object DoneMining extends Mining } class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging { import Miner._ implicit val ec: ExecutionContext = context.dispatcher implicit val timeout = timeoutPoW.millis override def receive: Receive = idle def idle: Receive = { case Mine(blockPrev, trans) => context.become(busy) val recipient = sender() val newBlock = generateNewBlock(blockPrev, trans) generatePoW(newBlock).map{ newNonce => recipient ! newBlock.copy(nonce = newNonce) }. recover{ case e: Exception => recipient ! Status.Failure(e) } case _ => // Do nothing } def busy: Receive = { case Mine(b, t) => log.error(s"[Mining] Miner.Mine($b, $t) received but $this is busy!") sender() ! Status.Failure(new Blockchainer.BusyException(s"$this is busy!")) case DoneMining => context.become(idle) log.info(s"[Mining] Miner.DoneMining received.") } private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ??? private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ??? }
// Akka classic actor `Miner`
object Miner {
  def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW)

  sealed trait Mining
  case class Mine(blockPrev: Block, trans: Transactions) extends Mining
  case object DoneMining extends Mining
}

class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging {
  import Miner._

  implicit val ec: ExecutionContext = context.dispatcher
  implicit val timeout = timeoutPoW.millis

  override def receive: Receive = idle

  def idle: Receive = {
    case Mine(blockPrev, trans) =>
      context.become(busy)

      val recipient = sender()
      val newBlock = generateNewBlock(blockPrev, trans)

      generatePoW(newBlock).map{ newNonce =>
          recipient ! newBlock.copy(nonce = newNonce)
        }.
        recover{ case e: Exception =>
          recipient ! Status.Failure(e)
        }

    case _ =>
      // Do nothing
  }

  def busy: Receive = {
    case Mine(b, t) =>
      log.error(s"[Mining] Miner.Mine($b, $t) received but $this is busy!")
      sender() ! Status.Failure(new Blockchainer.BusyException(s"$this is busy!"))

    case DoneMining =>
      context.become(idle)
      log.info(s"[Mining] Miner.DoneMining received.")
  }

  private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ???

  private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ???
}

It should be noted that Miner is an actor that serves to return a mined block (of type Blockchainer.Req) upon receiving an ask query by another actor (Blockchainer). Within actor Miner, method generatePoW() produces asynchronously a value of Future[Long] which then gets embedded in a block to be sent back to the querying actor.
应该注意的是, Miner 是一个参与者,在收到另一个参与者( Blockchainer )的 ask 查询时返回一个挖掘的块(类型为 Blockchainer.Req )。在参与者 Miner 中,方法 generatePoW() 异步生成一个值 Future[Long] ,然后将其嵌入到一个块中以发送回查询参与者。

ADT for actor message type
执行组件消息类型的 ADT

Before composing the typed actor version of Miner, let’s first look at what message type we would allow it to receive, since the actor reference of a typed actor is strictly typed.
在编写 Miner 的类型化参与者版本之前,让我们先看看我们允许它接收哪种消息类型,因为类型化参与者的 actor 引用是严格类型的。

// Akka typed actor `Miner`
object Miner {
sealed trait Mining
case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining
case object DoneMining extends Mining
// ...
}
// Akka typed actor `Miner` object Miner { sealed trait Mining case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining case object DoneMining extends Mining // ... }
// Akka typed actor `Miner`
object Miner {
  sealed trait Mining
  case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining
  case object DoneMining extends Mining
  // ...
}

Similar to how message types are commonly set up in an Akka classic actor, they’re also generally defined as an ADT (algebraic data type) in Akka Typed. Since the actor reference is now typed, the ADT plays an additional role – its base trait becomes the type parameter of the actor. By defining the actor reference of Miner as ActorRef[Miner.Mining], the actor will only be able to take messages of type Miner.Mining or its subtypes.
与 Akka 经典 actor 中通常设置消息类型的方式类似,它们通常也被定义为 Akka 类型中的 ADT(代数数据类型)。由于现在已类型化了参与者引用,因此 ADT 扮演了额外的角色 - 其基本特征成为参与者的类型参数。通过将 Miner 的参与者引用定义为 ActorRef[Miner.Mining] ,参与者将只能接收 Miner.Mining 类型的消息或其子类型。

Because of the type constraint for a given Akka typed actor, the good old non-type binding sender() is no longer supported. To ensure an actor is able to reply to the sender with proper message type, it’s common to see messages sent across typed actors explicitly carrying the sender’s reference. Case class Mine in the Akka typed actor has a replyTo of type ActorRef[Blockchainer.Req] because message Mine is sent via an ask query by actor Blockchainer which expects suitable message type (in this case, Blockchainer.Req) to be returned by actor Miner.
由于给定 Akka 类型化 actor 的类型约束,不再支持旧的非类型绑定 sender() 。为了确保执行组件能够使用正确的消息类型回复发件人,通常会看到跨类型执行组件发送的消息显式带有发件人的引用。Akka 类型化参与者中的案例类 Mine 具有 ActorRef[Blockchainer.Req] 类型的 replyTo ,因为消息 Mine 是通过参与者 Blockchainerask 查询发送的,该查询期望参与者 Miner 返回合适的消息类型(在本例中为 Blockchainer.Req )。

Functional or object-oriented style?
功能风格还是面向对象风格?

For our Miner actor, the functional approach would be to define all behaviors as methods (e.g. Behaviors.sendMessage) within the object Miner alone. The standard object-oriented alternative would be to add the companion Miner class that extends AbstractBehavior. For this blockchain mining application, I’m going to pick the object-oriented approach, partly for the personal preference of the more structural companion class-object model.
对于我们的 Miner actor,函数式方法是将所有行为定义为方法(例如 Behaviors.sendMessage ) 仅在 object Miner 内。标准的面向对象替代方案是添加扩展 AbstractBehavior 的配套 Miner 类。对于这个区块链挖掘应用程序,我将选择面向对象的方法,部分原因是出于个人对更具结构的同伴类对象模型的偏好。

Extending AbstractBehavior would require implementation of abstract method onMessage() which has the following signature:
扩展 AbstractBehavior 需要实现具有以下签名的抽象方法 onMessage()

// AbstractBehavior.onMessage()
abstract def onMessage(msg: T): Behavior[T]
// AbstractBehavior.onMessage() abstract def onMessage(msg: T): Behavior[T]
// AbstractBehavior.onMessage()
abstract def onMessage(msg: T): Behavior[T]

The typed Miner actor would look like this:
键入的 Miner actor 如下所示:

// Akka typed actor `Miner`
object Miner {
// ...
def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
Behaviors.setup(context =>
new Miner(context, accountKey, timeoutPoW)
)
}
class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long)
extends AbstractBehavior[Miner.Mining](context) {
import Miner._
override onMessage(msg: Mining): Behavior[Mining] = msg match {
case Mine(...) => ???
case DoneMining => ???
}
// ...
}
// Akka typed actor `Miner` object Miner { // ... def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] = Behaviors.setup(context => new Miner(context, accountKey, timeoutPoW) ) } class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) extends AbstractBehavior[Miner.Mining](context) { import Miner._ override onMessage(msg: Mining): Behavior[Mining] = msg match { case Mine(...) => ??? case DoneMining => ??? } // ... }
// Akka typed actor `Miner`
object Miner {
  // ...
  def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
    Behaviors.setup(context =>
      new Miner(context, accountKey, timeoutPoW)
    )
}

class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long)
    extends AbstractBehavior[Miner.Mining](context) {
  import Miner._
  override onMessage(msg: Mining): Behavior[Mining] = msg match {
    case Mine(...) => ???
    case DoneMining => ???
  }
  // ...
}

Mimicking context.become in Akka Typed
模仿上下文.成为 Akka 类型化

The “hotswapping” feature within the message loop of an Akka classic actor, context.become, is a useful functionality for state switching upon receipt of designated messages. Surprisingly, I haven’t been able to find any concrete examples of how the good old context.become should be done in Akka Typed. Mimicking the feature using the functional approach seems pretty straight forward, but since I’m taking the object-oriented approach it’s not immediately clear to me how onMessage() fits into the “behavioral” switching scheme.
Akka 经典角色 context.become 的消息循环中的“热交换”功能是在收到指定消息时进行状态切换的有用功能。令人惊讶的是,我找不到任何具体的例子来说明如何在 Akka Typed 中完成旧的 context.become 。使用函数式方法模仿该功能似乎非常简单,但由于我采用面向对象的方法,因此我不清楚 onMessage() 如何适应“行为”切换方案。

Method onMessage(msg:T) takes a received message and processes it in accordance with user-defined logic. Problem lies in the existence of the msg:T argument in the method. As soon as the actor is up, the initial message(s) received will be passed in, and in the case of a context.become switching to another Behavior method the received message would need to be delegated to the relayed method. However, Behavior methods are designed for taking a user-defined T => Behavior[T] function (or partial function), thus the initial msg:T must be handled from within onMessage(). This results in duplicated message processing logic among the Behavior methods.
方法 onMessage(msg:T) 采用收到的消息,并根据用户定义的逻辑对其进行处理。问题在于方法中存在 msg:T 参数。一旦参与者启动,收到的初始消息将被传入,并且在 context.become 切换到另一个 Behavior 方法的情况下,需要将收到的消息委托给中继方法。但是, Behavior 方法旨在获取用户定义的 T => Behavior[T] 函数(或部分函数),因此初始 msg:T 必须从 onMessage() 中处理。这会导致 Behavior 方法之间的消息处理逻辑重复。

Rather than taking the standard object-oriented approach, our Akka Typed Miner actor is defined with a companion class but without extending AbstractBehavior, thus leaving onMessage() out of the picture. A default message handler, messageLoop(), within class Miner is called upon instantiation by the companion object’s apply() to kick off a “behavior switching” loop. Behavior method idle() gets called, executes its business logic before conditionally relaying to another Behavior method busy() which, in turn, does its work and conditionally relays back to idle().
我们的 Akka Typed Miner actor 不是采用标准的面向对象方法,而是使用同伴类定义,但没有扩展 AbstractBehavior ,因此将 onMessage() 排除在外。类 Miner 中的默认消息处理程序 messageLoop() 在实例化时由配套对象的 apply() 调用,以启动“行为切换”循环。 Behavior 方法 idle() 被调用,在有条件地中继到另一个 Behavior 方法 busy() 之前执行其业务逻辑,后者反过来执行其工作并有条件地中继回 idle()

object Miner {
// ...
def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
Behaviors.setup(context =>
new Miner(context, accountKey, timeoutPoW).messageLoop()
// ^^^ Instantiate class `Miner` and run `messageLoop()`
)
}
class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
import Miner._
private def messageLoop(): Behavior[Mining] = idle() // <<< Switch to `idle() behavior`
private def idle(): Behavior[Mining] = ??? // <<< Conditionally relay to `busy()`
private def busy(): Behavior[Mining] = ??? // <<< Conditionally relay back to `idle()`
// ...
}
object Miner { // ... def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] = Behaviors.setup(context => new Miner(context, accountKey, timeoutPoW).messageLoop() // ^^^ Instantiate class `Miner` and run `messageLoop()` ) } class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) { import Miner._ private def messageLoop(): Behavior[Mining] = idle() // <<< Switch to `idle() behavior` private def idle(): Behavior[Mining] = ??? // <<< Conditionally relay to `busy()` private def busy(): Behavior[Mining] = ??? // <<< Conditionally relay back to `idle()` // ... }
object Miner {
  // ...
  def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
    Behaviors.setup(context =>
      new Miner(context, accountKey, timeoutPoW).messageLoop()
      // ^^^ Instantiate class `Miner` and run `messageLoop()`
    )
}

class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
  import Miner._
  private def messageLoop(): Behavior[Mining] = idle()  // <<< Switch to `idle() behavior`
  private def idle(): Behavior[Mining] = ???  // <<< Conditionally relay to `busy()`
  private def busy(): Behavior[Mining] = ???  // <<< Conditionally relay back to `idle()`
  // ...    
}

The blockchain mining actor in Akka Typed
Akka中的区块链挖矿演员 打字

Putting everything together, here’s what the blockchain mining actor in Akka Typed is like:
把所有东西放在一起,这就是Akka Typed中的区块链挖矿演员是什么样的:

// Akka typed actor `Miner`
object Miner {
sealed trait Mining
case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining
case object DoneMining extends Mining
def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
Behaviors.setup(context =>
new Miner(context, accountKey, timeoutPoW).messageLoop()
)
}
class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
import Miner._
implicit val ec: ExecutionContext = context.executionContext
implicit val timeout = timeoutPoW.millis
private def messageLoop(): Behavior[Mining] = idle() // <-- Switch to `idle() behavior`
private def idle(): Behavior[Mining] = Behaviors.receiveMessage{
case Mine(blockPrev, trans, replyTo) =>
val newBlock = generateNewBlock(blockPrev, trans)
generatePoW(newBlock).map(newNonce =>
replyTo ! Blockchainer.MiningResult(newBlock.copy(nonce = newNonce))
).
recover{ case e: Exception => Blockchainer.OtherException(s"$e") }
busy() // <-- Switch to `busy() behavior`
case DoneMining =>
Behaviors.same
}
private def busy(): Behavior[Mining] = Behaviors.receiveMessage{
case Mine(blockPrev, trans, replyTo) =>
context.log.error(s"[Mining] Miner.Mine($blockPrev, $trans) received but $this is busy!")
replyTo ! Blockchainer.BusyException(s"$this is busy!")
Behaviors.same
case DoneMining =>
context.log.info(s"[Mining] Miner.DoneMining received.")
idle() // <-- Switch back to `idle() behavior`
}
private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ???
private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ???
}
// Akka typed actor `Miner` object Miner { sealed trait Mining case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining case object DoneMining extends Mining def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] = Behaviors.setup(context => new Miner(context, accountKey, timeoutPoW).messageLoop() ) } class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) { import Miner._ implicit val ec: ExecutionContext = context.executionContext implicit val timeout = timeoutPoW.millis private def messageLoop(): Behavior[Mining] = idle() // <-- Switch to `idle() behavior` private def idle(): Behavior[Mining] = Behaviors.receiveMessage{ case Mine(blockPrev, trans, replyTo) => val newBlock = generateNewBlock(blockPrev, trans) generatePoW(newBlock).map(newNonce => replyTo ! Blockchainer.MiningResult(newBlock.copy(nonce = newNonce)) ). recover{ case e: Exception => Blockchainer.OtherException(s"$e") } busy() // <-- Switch to `busy() behavior` case DoneMining => Behaviors.same } private def busy(): Behavior[Mining] = Behaviors.receiveMessage{ case Mine(blockPrev, trans, replyTo) => context.log.error(s"[Mining] Miner.Mine($blockPrev, $trans) received but $this is busy!") replyTo ! Blockchainer.BusyException(s"$this is busy!") Behaviors.same case DoneMining => context.log.info(s"[Mining] Miner.DoneMining received.") idle() // <-- Switch back to `idle() behavior` } private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ??? private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ??? }
// Akka typed actor `Miner`
object Miner {
  sealed trait Mining
  case class Mine(blockPrev: Block, trans: Transactions, replyTo: ActorRef[Blockchainer.Req]) extends Mining
  case object DoneMining extends Mining

  def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
    Behaviors.setup(context =>
      new Miner(context, accountKey, timeoutPoW).messageLoop()
    )
}

class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
  import Miner._

  implicit val ec: ExecutionContext = context.executionContext
  implicit val timeout = timeoutPoW.millis

  private def messageLoop(): Behavior[Mining] = idle()  // <-- Switch to `idle() behavior`

  private def idle(): Behavior[Mining] = Behaviors.receiveMessage{
    case Mine(blockPrev, trans, replyTo) =>
      val newBlock = generateNewBlock(blockPrev, trans)
      generatePoW(newBlock).map(newNonce =>
          replyTo ! Blockchainer.MiningResult(newBlock.copy(nonce = newNonce))
        ).
        recover{ case e: Exception => Blockchainer.OtherException(s"$e") }
      busy()  // <-- Switch to `busy() behavior`

    case DoneMining =>
      Behaviors.same
  }

  private def busy(): Behavior[Mining] = Behaviors.receiveMessage{
    case Mine(blockPrev, trans, replyTo) =>
      context.log.error(s"[Mining] Miner.Mine($blockPrev, $trans) received but $this is busy!")
      replyTo ! Blockchainer.BusyException(s"$this is busy!")
      Behaviors.same

    case DoneMining =>
      context.log.info(s"[Mining] Miner.DoneMining received.")
      idle()  // <-- Switch back to `idle() behavior`
  }

  private def generateNewBlock(blockPrev: Block, trans: Transactions): LinkedBlock = ???

  private def generatePoW(block: Block)(implicit ec: ExecutionContext, timeout: FiniteDuration): Future[Long] = ???
}

1 thought on “From Akka Untyped To Typed Actors
关于“从阿卡无类型到类型演员”的 1 条思考

  1. Pingback: Actor-based Blockchain In Akka Typed | Genuine Blog
    pingback:Akka类型中基于演员的区块链|正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Typed: Spawn, Tell, Ask
阿卡类型:生成、告诉、询问

This is the 2nd post of the 3-part blog series about migrating an Akka classic actor-based application to one with Akka typed actors. In the 1st post, we looked at how to convert a classic actor into a typed actor.
这是由 3 部分组成的博客系列的第 2 篇文章,内容涉及将基于 Akka 经典的基于角色的应用程序迁移到具有 Akka 类型执行组件的应用程序。在第一篇文章中,我们研究了如何将经典演员转换为类型化演员。

Obviously, the goal of this mini blog series isn’t to cover all of the classic-to-typed-actors migration how-to’s, given the vast feature set Akka provides. The application being migrated is a blockchain application that mimics mining of a decentralized cryptocurrency. We’re going to cover just the key actor features used by the application, namely:
显然,鉴于 Akka 提供的广泛功能集,这个迷你博客系列的目标并不是涵盖所有经典到类型演员的迁移操作方法。正在迁移的应用程序是一个区块链应用程序,模仿去中心化加密货币的挖掘。我们将只介绍应用程序使用的关键参与者功能,即:

  • Starting an actor
    启动演员
  • Tell
  • Ask
  • Scheduler
  • Distributed PubSub
    分布式发布子

In this blog post, we’ll go over the first three bullet items.
在这篇博文中,我们将介绍前三个项目符号项。

Starting an Akka actor
开始一个阿卡演员

In the Akka classic API, context method actorOf() starts an actor with its “properties” provided by the configuration factory Props. For example, actor Miner is started from within its parent actor as follows:
在 Akka 经典 API 中,上下文方法 actorOf() 使用配置工厂 Props 提供的“属性”启动一个 actor。例如,actor Miner 从其父 actor 内部启动,如下所示:

// Starting classic actor `Miner`
val miner: ActorRef = context.actorOf(Miner.props(accountKey, timeoutPoW), "miner")
// Starting classic actor `Miner` val miner: ActorRef = context.actorOf(Miner.props(accountKey, timeoutPoW), "miner")
// Starting classic actor `Miner`
val miner: ActorRef = context.actorOf(Miner.props(accountKey, timeoutPoW), "miner")

It’s common to provide the actor class properties for Props by means of an actor’s companion object method especially when the actor class takes parameters.
通常通过参与者的伴随对象方法为 Props 提供 actor 类属性,尤其是在 actor 类采用参数时。

// Classic actor `Miner` class
object Miner {
def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW)
// ...
}
class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging {
// ...
}
// Classic actor `Miner` class object Miner { def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW) // ... } class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging { // ... }
// Classic actor `Miner` class
object Miner {
  def props(accountKey: String, timeoutPoW: Long): Props = Props(classOf[Miner], accountKey, timeoutPoW)
  // ...
}

class Miner(accountKey: String, timeoutPoW: Long) extends Actor with ActorLogging {
  // ...
}

Starting an actor in Akka Typed is performed using actor context method spawn(). The typed version of actor Miner would be started from within its parent actor like below:
在 Akka 类型中启动执行组件是使用 actor 上下文方法 spawn() 执行的。actor Miner 的类型化版本将从其父 actor 中启动,如下所示:

// Starting typed actor `Miner`
val miner: ActorRef[Miner.Mining] = context.spawn(Miner(accountKey, timeoutPoW), "miner")
// Starting typed actor `Miner` val miner: ActorRef[Miner.Mining] = context.spawn(Miner(accountKey, timeoutPoW), "miner")
// Starting typed actor `Miner`
val miner: ActorRef[Miner.Mining] = context.spawn(Miner(accountKey, timeoutPoW), "miner")

An actor’s underlying class properties can now be defined using Behavior methods, hence the Props configuration factory is no longer needed. Below is how the typed Miner can be defined using method Behaviors.setup() from within the actor’s companion object:
现在可以使用 Behavior 方法定义参与者的基础类属性,因此不再需要 Props 配置工厂。下面介绍如何使用参与者的伴随对象中的方法 Behaviors.setup() 定义键入的 Miner

// Typed actor `Miner` class
object Miner {
// ...
def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
Behaviors.setup(context =>
new Miner(context, accountKey, timeoutPoW).messageLoop()
)
}
class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
// ...
}
// Typed actor `Miner` class object Miner { // ... def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] = Behaviors.setup(context => new Miner(context, accountKey, timeoutPoW).messageLoop() ) } class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) { // ... }
// Typed actor `Miner` class
object Miner {
  // ...
  def apply(accountKey: String, timeoutPoW: Long): Behavior[Mining] =
    Behaviors.setup(context =>
      new Miner(context, accountKey, timeoutPoW).messageLoop()
    )
}

class Miner private(context: ActorContext[Miner.Mining], accountKey: String, timeoutPoW: Long) {
  // ...
}

Starting top-level actors
启动顶级参与者

What about in the main program before any actors have been created? In Akka classic API, one can simply invoke actorOf() from the ActorSystem to start main actors. Below is how the main program of a blockchain mining application spawn the top-level actors for mining simulations.
在创建任何Actor之前,在主程序中呢?在 Akka 经典 API 中,可以简单地从 ActorSystem 调用 actorOf() 来启动主要参与者。以下是区块链挖掘应用程序的 main 程序如何生成用于挖掘模拟的顶级参与者。

object Main {
def main(args: Array[String]): Unit = {
// Parse `args` and load configurations for the cluster and main actors
// ...
implicit val system = ActorSystem("blockchain", conf)
val blockchainer = system.actorOf(
Blockchainer.props(minerAccountKey, timeoutMining, timeoutValidation), "blockchainer"
)
val simulator = system.actorOf(
Simulator.props(blockchainer, transFeedInterval, miningAvgInterval), "simulator"
)
// ...
}
}
object Main { def main(args: Array[String]): Unit = { // Parse `args` and load configurations for the cluster and main actors // ... implicit val system = ActorSystem("blockchain", conf) val blockchainer = system.actorOf( Blockchainer.props(minerAccountKey, timeoutMining, timeoutValidation), "blockchainer" ) val simulator = system.actorOf( Simulator.props(blockchainer, transFeedInterval, miningAvgInterval), "simulator" ) // ... } }
object Main {
  def main(args: Array[String]): Unit = {
    // Parse `args` and load configurations for the cluster and main actors
    // ...

    implicit val system = ActorSystem("blockchain", conf)
    val blockchainer = system.actorOf(
        Blockchainer.props(minerAccountKey, timeoutMining, timeoutValidation), "blockchainer"
      )
    val simulator = system.actorOf(
        Simulator.props(blockchainer, transFeedInterval, miningAvgInterval), "simulator"
      )

    // ...
  }
}

Perhaps for consistency reasons, Akka Typed makes actors always started from within an actor context using method spawn(), thus an explicit ActorContext is needed for top-level main actors. The following snippet shows how the typed version of the blockchain application delegates to the Starter actor for starting the main actors after the main program has loaded actor properties from program arguments and configuration file. The top-level user-defined actor Starter is regarded as the “user guardian”.
也许出于一致性的原因,Akka Typed 使 actor 总是使用方法 spawn() 从 actor 上下文中开始,因此顶级主 actor 需要一个显式 ActorContext 。以下代码片段显示了在主程序从程序参数和配置文件加载执行组件属性后,区块链应用程序的类型化版本如何委托给 Starter 参与者以启动主参与者。顶级用户定义的Actor Starter 被视为“用户守护者”。

object Main {
object Starter {
def apply(accountKey: String, timeoutMining: Long, timeoutValidation: Long,
transFeedInterval: Long, miningAvgInterval: Long, test: Boolean): Behavior[NotUsed] =
Behaviors.setup { context =>
Cluster(context.system)
val blockchainer = context.spawn(Blockchainer(accountKey, timeoutMining, timeoutValidation), "blockchainer")
val simulator = context.spawn(Simulator(blockchainer, transFeedInterval, miningAvgInterval), "simulator")
if (test)
simulator ! Simulator.QuickTest
else
simulator ! Simulator.MiningLoop
Behaviors.receiveSignal { case (_, Terminated(_)) => Behaviors.stopped }
}
}
def main(args: Array[String]): Unit = {
// Parse `args` and load configurations for the cluster and main actors
// ...
implicit val system = ActorSystem(
Starter(minerAccountKey, timeoutMining, timeoutValidation, transFeedInterval, miningAvgInterval, test),
"blockchain",
conf
)
}
}
object Main { object Starter { def apply(accountKey: String, timeoutMining: Long, timeoutValidation: Long, transFeedInterval: Long, miningAvgInterval: Long, test: Boolean): Behavior[NotUsed] = Behaviors.setup { context => Cluster(context.system) val blockchainer = context.spawn(Blockchainer(accountKey, timeoutMining, timeoutValidation), "blockchainer") val simulator = context.spawn(Simulator(blockchainer, transFeedInterval, miningAvgInterval), "simulator") if (test) simulator ! Simulator.QuickTest else simulator ! Simulator.MiningLoop Behaviors.receiveSignal { case (_, Terminated(_)) => Behaviors.stopped } } } def main(args: Array[String]): Unit = { // Parse `args` and load configurations for the cluster and main actors // ... implicit val system = ActorSystem( Starter(minerAccountKey, timeoutMining, timeoutValidation, transFeedInterval, miningAvgInterval, test), "blockchain", conf ) } }
object Main {
  object Starter {
    def apply(accountKey: String, timeoutMining: Long, timeoutValidation: Long,
              transFeedInterval: Long, miningAvgInterval: Long, test: Boolean): Behavior[NotUsed] =
      Behaviors.setup { context =>
        Cluster(context.system)

        val blockchainer = context.spawn(Blockchainer(accountKey, timeoutMining, timeoutValidation), "blockchainer")
        val simulator = context.spawn(Simulator(blockchainer, transFeedInterval, miningAvgInterval), "simulator")

        if (test)
          simulator ! Simulator.QuickTest
        else
          simulator ! Simulator.MiningLoop

        Behaviors.receiveSignal { case (_, Terminated(_)) => Behaviors.stopped }
      }
  }

  def main(args: Array[String]): Unit = {
    // Parse `args` and load configurations for the cluster and main actors
    // ...

    implicit val system = ActorSystem(
      Starter(minerAccountKey, timeoutMining, timeoutValidation, transFeedInterval, miningAvgInterval, test),
      "blockchain",
      conf
    )
  }
}

The fire-and-forget “tell”
即发即弃的“告诉”

The most common communication means among actors is via method tell in a fire-and-forget fashion, which in essence implies that messages are sent with an at-most-once guarantee.
参与者之间最常见的通信方式是通过方法 tell 以即发即弃的方式,这实质上意味着消息以最多一次保证发送。

Akka classic tell and the symbolic ! variant have the following method signatures:
Akka 经典 tell 和符号 ! 变体具有以下方法签名:

// Akka classic methods `tell` and `!`
final def tell(msg: Any, sender: ActorRef): Unit
abstract def !(message: Any)(implicit sender: ActorRef = Actor.noSender): Unit
// Akka classic methods `tell` and `!` final def tell(msg: Any, sender: ActorRef): Unit abstract def !(message: Any)(implicit sender: ActorRef = Actor.noSender): Unit
// Akka classic methods `tell` and `!`
final def tell(msg: Any, sender: ActorRef): Unit
abstract def !(message: Any)(implicit sender: ActorRef = Actor.noSender): Unit

In Akka Typed, methods tell and ! have signatures as follows:
在 Akka Typed 中,方法 tell! 具有如下签名:

// Akka Typed methods `tell` and `!`
abstract def tell(msg: T): Unit
def !(msg: T): Unit
// Akka Typed methods `tell` and `!` abstract def tell(msg: T): Unit def !(msg: T): Unit
// Akka Typed methods `tell` and `!`
abstract def tell(msg: T): Unit
def !(msg: T): Unit

Though messages being sent are now strictly typed in the new API, expressions consisting of tell in Akka classic and Akka Typed, both returning a Unit, essentially look and feel the same.
虽然发送的消息现在严格在新 API 中键入,但由 Akka classic 中的 tell 和 Akka Typed 组成的表达式,都返回 Unit ,本质上外观和感觉相同。

The request-response query “ask”
请求-响应查询“询问”

The other commonly used communication means among actors is the request-response query via method ask.
参与者之间另一种常用的通信方式是通过方法 ask 进行请求-响应查询。

Below are method signatures of Akka classic ask and ?:
以下是 Akka 经典 ask? 的方法签名:

// Akka classic methods `ask` and `?`
def ask(actorSelection: ActorSelection, message: Any)(implicit timeout: Timeout): Future[Any]
def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any]
// Akka classic methods `ask` and `?` def ask(actorSelection: ActorSelection, message: Any)(implicit timeout: Timeout): Future[Any] def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any]
// Akka classic methods `ask` and `?`
def ask(actorSelection: ActorSelection, message: Any)(implicit timeout: Timeout): Future[Any]
def ?(message: Any)(implicit timeout: Timeout, sender: ActorRef = Actor.noSender): Future[Any]

Here’s how the Blockchainer actor uses the classic ask to request for a new mined block from actor Miner and handle the Future query result:
以下是 Blockchainer actor 如何使用经典 ask 从 actor Miner 请求新的挖掘区块并处理 Future 查询结果:

(miner ? Miner.Mine(blockPrev, trans))(tmoMining).mapTo[Block] onComplete{
case Success(block) =>
mediator ! Publish("new-block", UpdateBlockchain(block))
miner ! Miner.DoneMining
case Failure(e) =>
log.error(s"[Req.Mining] ${this}: ERROR: $e")
e match {
case _: BusyException => self ! AddTransactions(trans, append = false)
case _ => miner ! Miner.DoneMining
}
}
(miner ? Miner.Mine(blockPrev, trans))(tmoMining).mapTo[Block] onComplete{ case Success(block) => mediator ! Publish("new-block", UpdateBlockchain(block)) miner ! Miner.DoneMining case Failure(e) => log.error(s"[Req.Mining] ${this}: ERROR: $e") e match { case _: BusyException => self ! AddTransactions(trans, append = false) case _ => miner ! Miner.DoneMining } }
(miner ? Miner.Mine(blockPrev, trans))(tmoMining).mapTo[Block] onComplete{
  case Success(block) =>
    mediator ! Publish("new-block", UpdateBlockchain(block))
    miner ! Miner.DoneMining
  case Failure(e) =>
    log.error(s"[Req.Mining] ${this}: ERROR: $e")
    e match {
      case _: BusyException => self ! AddTransactions(trans, append = false)
      case _ => miner ! Miner.DoneMining
    }
}

Note that the tmoMining timeout value is being explicit passed in as a method argument and mapTo[Block] is necessary for mapping the returned Future[Any] to the proper type.
请注意, tmoMining 超时值作为方法参数显式传入,并且 mapTo[Block] 是将返回的 Future[Any] 映射到正确类型所必需的。

In addition, there are also a few general-purpose AskSupport methods allowing a query between two actors both specified as parameters. Here’s the method signature of one of the ask variants:
此外,还有一些通用的 AskSupport 方法允许在两个都指定为参数的参与者之间进行查询。下面是 ask 变体之一的方法签名:

// Akka classic sender-explicit `ask`
def ask(actorRef: ActorRef, message: Any, sender: ActorRef)(implicit timeout: Timeout): Future[Any]
// Akka classic sender-explicit `ask` def ask(actorRef: ActorRef, message: Any, sender: ActorRef)(implicit timeout: Timeout): Future[Any]
// Akka classic sender-explicit `ask`
def ask(actorRef: ActorRef, message: Any, sender: ActorRef)(implicit timeout: Timeout): Future[Any]

As for Akka Typed, a context method ask is provided with the following signature:
至于 Akka Typed,提供了具有以下签名的上下文方法 ask

// Akka Typed context method `ask`
abstract def ask[Req, Res](target: RecipientRef[Req], createRequest: (ActorRef[Res]) => Req)(
mapResponse: (Try[Res]) => T)(implicit responseTimeout: Timeout, classTag: ClassTag[Res]): Unit
// Akka Typed context method `ask` abstract def ask[Req, Res](target: RecipientRef[Req], createRequest: (ActorRef[Res]) => Req)( mapResponse: (Try[Res]) => T)(implicit responseTimeout: Timeout, classTag: ClassTag[Res]): Unit
// Akka Typed context method `ask`
abstract def ask[Req, Res](target: RecipientRef[Req], createRequest: (ActorRef[Res]) => Req)(
    mapResponse: (Try[Res]) => T)(implicit responseTimeout: Timeout, classTag: ClassTag[Res]): Unit

Below is how the typed version of actor Blockchainer would use context ask to query typed actor Miner:
下面是 actor Blockchainer 的类型化版本如何使用上下文 ask 来查询类型化 actor Miner

implicit val tmoMining: Timeout = Timeout(timeoutMining.millis)
context.ask(miner, ref => Miner.Mine(blockPrev, trans, ref)) {
case Success(r) =>
r match {
case MiningResult(block) =>
topicBlock ! Topic.Publish(UpdateBlockchain(block))
miner ! Miner.DoneMining
MiningResult(block)
case _ =>
OtherException(s"Unknown mining result $r")
}
case Failure(e) =>
context.log.error(s"[Req.Mining] ${this}: ERROR: $e")
e match {
case _: BusyException =>
context.self ! AddTransactions(trans, append = false)
BusyException(e.getMessage)
case _ =>
miner ! Miner.DoneMining
OtherException(e.getMessage)
}
}
implicit val tmoMining: Timeout = Timeout(timeoutMining.millis) context.ask(miner, ref => Miner.Mine(blockPrev, trans, ref)) { case Success(r) => r match { case MiningResult(block) => topicBlock ! Topic.Publish(UpdateBlockchain(block)) miner ! Miner.DoneMining MiningResult(block) case _ => OtherException(s"Unknown mining result $r") } case Failure(e) => context.log.error(s"[Req.Mining] ${this}: ERROR: $e") e match { case _: BusyException => context.self ! AddTransactions(trans, append = false) BusyException(e.getMessage) case _ => miner ! Miner.DoneMining OtherException(e.getMessage) } }
implicit val tmoMining: Timeout = Timeout(timeoutMining.millis)
context.ask(miner, ref => Miner.Mine(blockPrev, trans, ref)) {
  case Success(r) =>
    r match {
      case MiningResult(block) =>
        topicBlock ! Topic.Publish(UpdateBlockchain(block))
        miner ! Miner.DoneMining
        MiningResult(block)
      case _ =>
        OtherException(s"Unknown mining result $r")
    }
  case Failure(e) =>
    context.log.error(s"[Req.Mining] ${this}: ERROR: $e")
    e match {
      case _: BusyException =>
        context.self ! AddTransactions(trans, append = false)
        BusyException(e.getMessage)
      case _ =>
        miner ! Miner.DoneMining
        OtherException(e.getMessage)
    }
}

Note that context ask doesn’t return the query result as a Future to be handled by a callback such as onComplete. Rather, it expects one to handle the query response by providing a Try[Res] => T function.
请注意,上下文 ask 不会将查询结果作为 Future 返回,以便由 onComplete 等回调处理。相反,它期望通过提供 Try[Res] => T 函数来处理查询响应。

Akka Typed also provides AskPattern methods that return Futures with below method signatures:
Akka Typed 还提供了 AskPattern 方法,这些方法返回具有以下方法签名的期货:

// Akka Typed Future-returning methods `ask` and `?`
def ask[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res]
def ?[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res]
// Akka Typed Future-returning methods `ask` and `?` def ask[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res] def ?[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res]
// Akka Typed Future-returning methods `ask` and `?`
def ask[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res]
def ?[Res](replyTo: (ActorRef[Res]) => Req)(implicit timeout: Timeout, scheduler: Scheduler): Future[Res]

That’s all for this post. In the next blog, we’ll wrap up this mini blog series with the remaining topics (i.e. scheduler and distributed pubsub).
这就是这篇文章的全部内容。在下一篇博客中,我们将用剩余的主题(即 schedulerdistributed pubsub )。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Akka Typed: Scheduler, PubSub
Akka Typed: Scheduler, PubSub

This is the final post of the 3-part blog series that centers around migrating an actor-based blockchain application from Akka classic to Akka Typed. In the previous post, we looked at the difference between the two APIs in starting actors and in tell-ing and ask-ing. This time, we’re going to cover the following topics:
这是由 3 部分组成的博客系列的最后一篇文章,该系列围绕将基于参与者的区块链应用程序从 Akka classic 迁移到 Akka Typed。在上一篇文章中,我们研究了两个API在起始Actor和 tell -ing和 ask -ing中的区别。这一次,我们将介绍以下主题:

  • Scheduler
  • Distributed PubSub

Scheduler

Akka classic’s Scheduler feature provides task-scheduling methods for recurring schedule (e.g. scheduleAtFixedRate(), which replaces the deprecated schedule()) or one-time schedule (scheduleOnce()), each with a few signature variants. For example, scheduleOnce() has more than a couple of different method signatures for different use cases:
Akka classic的 Scheduler 功能为重复计划提供了任务调度方法(例如 scheduleAtFixedRate() ,它替换了已弃用的 schedule() ) 或一次性计划 ( scheduleOnce() ),每个都有一些签名变体。例如, scheduleOnce() 针对不同的用例具有多个不同的方法签名:

The blockchain application uses schedulers in a number of places. For instance, to ensure that the potentially time-consuming proof generating process won’t take more than a set duration, the generatePoW() method within actor Miner uses scheduleOnce() to try complete a Promise with a timeout exception at a scheduled time:
区块链应用程序在许多地方使用调度程序。例如,为了确保可能耗时的证明生成过程不会花费超过设定的持续时间,actor Miner 中的 generatePoW() 方法使用 scheduleOnce() 尝试在计划的时间完成具有超时异常的 Promise

Similar to its counterpart in the classic API, Akka Typed Scheduler also provides task-scheduling methods for recurring and one-time schedules. Here’s the method signature of the typed version of scheduleOnce():
与经典 API 中的对应项类似,Akka Typed Scheduler 也为定期和一次性计划提供了任务调度方法。下面是 scheduleOnce() 的类型化版本的方法签名:

The variant that takes a f: => Unit by-name parameter is no longer available in Akka Typed. However, since Runnable only has a single abstract method (SAM), run(), the following expressions are essentially the same:
采用 f: => Unit 按名称参数的变体在 Akka 类型化中不再可用。但是,由于 Runnable 只有一个抽象方法 (SAM), run() ,因此以下表达式本质上是相同的:

For example, the typed version of the proof-of-work generating method generatePoW() could be coded as follows:
例如,工作量证明生成方法 generatePoW() 的类型化版本可以按如下方式编码:

Also seemingly missing is the other variant for sending a message to another actor. In fact, it has been moved into ActorContext as a separate scheduleOnce() method which can be invoked from within a typed actor’s context for scheduling a one-time delivery of message to another actor:
似乎还缺少另一种用于向另一个参与者发送消息的变体。事实上,它已作为单独的 scheduleOnce() 方法移动到 ActorContext 中,可以从类型化参与者的上下文中调用,以调度一次性向另一个参与者传递消息:

The above schedulers are all thread-safe. There is also this TimerScheduler that is not thread-safe, hence should be used within the actor that owns it. An advantage of using TimerScheduler, typically via Behaviors.withTimers(), is that it’s bound to the lifecycle of the containing actor and will be cancelled automatically when the actor is stopped.
上述调度程序都是线程安全的。还有这个 TimerScheduler 不是线程安全的,因此应该在拥有它的 actor 中使用。使用 TimerScheduler 的一个优点,通常是通过 Behaviors.withTimers() ,是它绑定到包含的参与者的生命周期,并且在参与者停止时会自动取消。

Distributed PubSub
分布式发布子

A cryptocurrency typically maintains a decentralized ledger as distributed copies of a growing blockchain kept by individual nodes on the system. The blockchain application being used for this actor migration exercise achieves that by means of running a Distributed Publish Subscribe service on an Akka cluster. Named topics (in this case “new-transactions” and “new-block”) can be created and subscribers to a given topic will be sent objects submitted to the topic by the publishers.
加密货币通常维护一个分散的分类账,作为由系统上各个节点保存的不断增长的区块链的分布式副本。用于此参与者迁移练习的区块链应用程序通过在 Akka 集群上运行分布式发布订阅服务来实现这一点。可以创建命名主题(在本例中为“新事务”和“新块”),并且给定主题的订阅者将向发布者发送到该主题的对象。

In the Akka classic API, the mediator actor, DistributedPubSubMediator, which is supposed to be started on each of the allocated cluster nodes is responsible for managing a registry of actor references and replicating the entries to peer actors among the nodes.
在 Akka 经典 API 中,中介参与者 DistributedPubSubMediator 应该在每个分配的集群节点上启动,负责管理参与者引用的注册表,并将条目复制到节点之间的对等参与者。

Below is how the mediator actor started from within the Blockchainer actor of a cluster node registers subscribers (in this case the Blockchainer actor itself) and takes published topical objects to be consumed by peer cluster nodes:
以下是调解器参与者如何从集群节点的 Blockchainer 参与者内部开始注册订阅者(在本例中为 Blockchainer 参与者本身),并获取已发布的主题对象以供对等集群节点使用:

The Akka Typed pubsub abandons the “universal” mediator actor in favor of “topic-specific” actors. The typed version of Distributed PubSub functionality initiated from within the Blockchainer actor of a cluster node is as follows:
Akka Typed pubsub放弃了“通用”的调解员演员,转而支持“特定主题”的演员。从群集节点的 Blockchainer actor 中启动的分布式 PubSub 功能的类型化版本如下所示:

Final thoughts

This concludes the mini blog series that covers the basics and selected features of Akka Typed actors. The resulting blockchain application in Akka Typed will be published on GitHub along with an overview in a separate blog.
迷你博客系列到此结束,该系列涵盖了 Akka 类型演员的基本知识和精选功能。Akka Typed中生成的区块链应用程序将在GitHub上发布,并在单独的博客中提供概述。

The term “behavior” is commonly used in the standard Actor Model when describing how actors operate and mutate their internal states in accordance with the business logic. As can be seen across sample snippets in this blog series, it’s ubiquitous in the Akka Typed API, with all the Behavior-typed methods from the Behaviors factory object spanning the entire lifecycle of actors. In comparison with putting all actor business logic inside the receive partial function “blackbox” in Akka classic, it does make the using of the various methods more intuitive, especially for new comers.
术语“行为”通常用于标准Actor模型中,用于描述参与者如何根据业务逻辑操作和改变其内部状态。从本博客系列的示例片段中可以看出,它在 Akka 类型化 API 中无处不在,来自 Behaviors 工厂对象的所有 Behavior 类型化方法跨越了参与者的整个生命周期。与 Akka 经典中将所有 actor 业务逻辑放在 receive 部分函数“黑盒”中相比,它确实使各种方法的使用更加直观,尤其是对于新手。

With some self-discipline in sticking to programming best-practices, the loosely-typed Akka classic API has an advantage of embracing complex non-blocking message processing functionality with minimal boilerplate code. The message “loop” within an actor in the form of a partial function along with the hotswapping feature via context.become provides a simple yet robust construct for processing messages. I’m certainly going to miss its simplicity and elegance. That being said, moving towards the typed Akka API is inevitable if one plans to use Akka actors not just for a one-time project. It’s the right direction for sticking to idiomatic functional programming.
松散类型的 Akka 经典 API 在坚持编程最佳实践方面具有一些自律性,具有以最少的样板代码包含复杂的非阻塞消息处理功能的优势。参与者中的消息“循环”以分部函数的形式以及通过 context.become 的热交换功能为处理消息提供了一种简单而强大的构造。我当然会怀念它的简单和优雅。话虽如此,如果计划将 Akka actor 用于一次性项目,那么转向类型化的 Akka API 是不可避免的。这是坚持惯用函数式编程的正确方向。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Actor-based Blockchain In Akka Typed
Akka类型中基于参与者的区块链

As blockchain computing continues to steadily gain momentum across various industries, relevant platforms such as Ethereum, Hyperledger, etc, have emerged and prospered. Even though the term blockchain has evolved beyond a mere keyword for cryptocurrency, its core underlying operational structure still adheres to how a cryptocurrency fundamentally maintains a decentralized ledger — that is, as distributed copies of a growing blockchain kept by individual nodes on the system to agree on an eventual version of the blockchain via a consensual algorithm.
随着区块链计算在各行各业的持续发展势头,以太坊、超级账本等相关平台已经出现并蓬勃发展。尽管区块链一词已经超越了加密货币的单纯关键字,但其核心底层运营结构仍然坚持加密货币如何从根本上维护去中心化分类账——也就是说,作为不断增长的区块链的分布式副本,由系统上的各个节点保存,以通过共识算法就区块链的最终版本达成一致。

Blockchain application using Akka classic actors
使用Akka经典演员的区块链应用程序

In 2020, I developed an Actor-based blockchain application (source code at GitHub) in Scala using Akka Actor’s classic API. While it’s primarily for proof of concept, the application does utilize relevant cryptographic functions (e.g. public key cryptography standards), hash data structure (e.g. Merkle trees), along with a simplified proof-of-work consensus algorithm to simulate mining of a decentralized cryptocurrency on a scalable cluster.
2020 年,我在 Scala 中使用 Akka Acter 的经典 API 开发了一个基于 Actor 的区块链应用程序(GitHub 上的源代码)。虽然它主要用于概念验证,但该应用程序确实利用了相关的加密功能(例如公钥加密标准)、哈希数据结构(例如默克尔树)以及简化的工作量证明共识算法来模拟在可扩展集群上挖掘去中心化加密货币。

Since the blockchain application consists of just a handful of Actors mostly handling common use cases, migrating it to using the typed Actor API serves a great trial exercise for something new. While it was never expected to be a trivial find-and-replace task, it was also not a particularly difficult one. A recent mini-blog series highlights the migration how-to’s of some key actor/cluster features used by the blockchain application.
由于区块链应用程序仅由少数主要处理常见用例的Actor组成,因此将其迁移到使用类型化的Actor API可以很好地试用新事物。虽然人们从未预料到这是一项微不足道的查找和更换任务,但它也不是一项特别困难的任务。最近的一个迷你博客系列重点介绍了区块链应用程序使用的一些关键参与者/集群功能的迁移方法。

The Akka Typed blockchain application
Akka类型化区块链应用程序

For the impatient, source code for the Akka Typed blockchain application is at this GitHub link.
对于不耐烦的人,Akka Typed 区块链应用程序的源代码位于此 GitHub 链接 .

Written in Scala, the build tool for the application is the good old sbt, with the library dependencies specified in “{project-root}/built.sbt”. Besides akka-actor-typed and akka-cluster-typed for the typed actor/cluster features, Bouncy Castle and Apache Commons Codec are included for processing public key files in PKCS#8 PEM format.
用Scala编写,应用程序的构建工具是旧的 sbt ,在“{project-root}/built.sbt”中指定了库依赖项。除了用于类型化参与者/集群功能的 akka-actor-typedakka-cluster-typed 之外,还包括 Bouncy Castle 和 Apache Commons 编解码器,用于处理 PKCS#8 PEM 格式的公钥文件。

For proof of concept purpose, the blockchain application can be run on a single computer on multiple Shell command terminals, using the default configurations specified in “{project-root}/src/main/resources/application.conf”. The configuration file consists of information related to Akka cluster/remoting transport protocol, seed nodes, etc. The cluster setup can be reconfigured to run on an actual network of computer nodes in the cloud. Also included in the configuration file are a number of configurative parameters for mining of the blockchain, such as mining reward, time limit, proof-of-work difficulty level, etc.
出于概念验证的目的,区块链应用程序可以使用“{project-root}/src/main/resources/application.conf”中指定的默认配置在多台 Shell 命令终端上的单台计算机上运行。配置文件包含与 Akka 集群/远程传输协议、种子节点等相关的信息。群集设置可以重新配置为在云中计算机节点的实际网络上运行。配置文件中还包括许多用于区块链挖矿的配置参数,例如挖矿奖励、时间限制、工作量证明难度级别等。

The underlying data structures of the blockchain
区块链的底层数据结构

Since all the changes to the blockchain application are only for migrating actors to Akka Typed, the underlying data structures for the blockchain, its inner structural dependencies as well as associated cryptographic functions remain unchanged.
由于对区块链应用程序的所有更改仅用于将参与者迁移到Akka Typed,因此区块链的底层数据结构,其内部结构依赖项以及相关的加密功能保持不变。

As an attempt to make this blog post an independent one, some content of the application overview is going to overlap the previous overview for the Akka classic application. Nonetheless, I’m including some extra diagrams for a little more clarity.
为了使这篇博文成为一篇独立的博文,应用程序概述的某些内容将与之前的 Akka 经典应用程序概述重叠。尽管如此,为了更清晰,我还是包括了一些额外的图表。

Below is a diagram showing the blockchain’s underlying data structures:
下图显示了区块链的底层数据结构:

  • Account, TransactionItem, Transactions
    帐户、交易项目、交易
  • MerkleTree
  • Block, RootBlock, LinkedBlock
    Block, RootBlock, LinkedBlock
  • ProofOfWork

The centerpiece of the blockchain data structures is the abstract class Block which is extended by RootBlock (i.e. the “genesis block”) and LinkedBlock. Each of the individual blocks is identified by the hash field and backward-linked to its predecessor via hashPrev.
区块链数据结构的核心是抽象类 Block ,它由 RootBlock (即“创世块”)和 LinkedBlock 扩展。每个单独的块都由 hash 字段标识,并通过 hashPrev 向后链接到其前一个块。

Field difficulty carries the difficulty level pre-set in the application configuration. The nonce field is initialized also from configuration and will be updated with the proof value returned from the proof-of-work consensual algorithm (which is difficulty-dependent).
字段 difficulty 包含应用程序配置中预设的难度级别。 nonce 字段也从配置中初始化,并将使用从工作量证明共识算法(取决于难度)返回的 proof 值进行更新。

Class Transactions represents a sequence of transaction items, along with the sender/receiver (of type Account) and timestamp. Both the transaction sequence and its hashed value merkleRoot are kept in a Block object. The Account class is identified by field key which is the Base64 public key of the PKCS keypair possessed by the account owner. As for object ProofOfWork, it’s a “static” class for keeping the consensual algorithmic methods for proof-of-work.
Transactions 表示一系列事务项,以及发送方/接收方(类型 Account )和时间戳。事务序列及其哈希值 merkleRoot 都保存在 Block 对象中。 Account 类由字段 key 标识,该字段是帐户所有者拥有的 PKCS 密钥对的 Base64 公钥。至于对象 ProofOfWork ,它是一个“静态”类,用于保留工作证明的共识算法方法。

For a deeper dive of the various objects’ inner workings, please read the following blogs:
要更深入地了解各种对象的内部工作原理,请阅读以下博客:

  1. Transaction Hash Tree in a Blockchain
    区块链中的交易哈希树
  2. Blockchain Mining and Proof-of-Work
    区块链挖掘和工作量证明

Source code for the data structures can be found under akkablockchain/model in the GitHub repo.
数据结构的源代码可以在 GitHub 存储库中的 akkablockchain/model 下找到。

The typed actors that “run” the blockchain
“运行”区块链的类型化参与者

As for the actors, aside from being revised from loosely to strictly typed actors, their respective functionalities within a given cluster node as well as among the peer actors on other cluster nodes remain unchanged.
至于Actor,除了从松散的Actor修改为严格类型的Actor之外,它们在给定集群节点内以及其他集群节点上的对等Actor之间的各自功能保持不变。

The following diagram highlights the hierarchical flow logic of the various actors on a given cluster node:
下图突出显示了给定群集节点上各种执行组件的分层流逻辑:

Starter – On each cluster node, the main program of blockchain application is initialized with starting up the top-level Starter actor (a.k.a. the user guardian), which in turn spawns two actors: Blockchainer and Simulator.
启动器 – 在每个集群节点上,区块链应用程序的 main 程序通过启动顶级 Starter actor(又名 user guardian )进行初始化,这反过来又生成两个参与者: BlockchainerSimulator

Blockchainer – The Blockchainer actor on any given cluster node maintains a copy of the blockchain and transaction queue for a miner identified by their cryptographic public key. It collects submitted transactions in the queue and updates the blockchain according to the proof-of-work consensual rules by means of the cluster-wide distributed pub/sub. The actor delegates mining work to its child actor Miner and validation of mined blocks to child actor BlockInspector.
Blockchainer – 任何给定群集节点上的 Blockchainer 参与者为由其加密公钥标识的 miner 维护区块链和事务队列的副本。它收集队列中提交的交易,并通过集群范围的分布式发布/订阅根据工作量证明共识规则更新区块链。参与者将挖掘工作委托给其子参与者 Miner ,并将挖掘块的验证委托给子参与者 BlockInspector

Miner – The mining tasks of carrying out computationally demanding proof-of-work is handled by the Miner actor. Using an asynchronous routine bound by a configurable time-out, actor Miner returns the proofs back to the parent actor via the Akka request-response ask queries.
矿工 – 执行计算要求苛刻的工作量证明的挖掘任务由 Miner 参与者处理。使用受可配置超时约束的异步例程,actor Miner 通过 Akka 请求-响应 ask 查询将证明返回给父参与者。

BlockInspector – This other child actor of Blockchainer is responsible for validating content of a newly mined block before it can be appended to the existing blockchain. The validation verifies that generated proof within the block as well as the intertwined hash values up the chain of the historical blocks. The result is then returned to the parent actor via Akka ask.
BlockInspector – Blockchainer 的另一个子参与者负责验证新开采的区块的内容,然后再将其附加到现有区块链。验证验证块内生成的证明以及历史块链上交织的哈希值。然后,结果通过 Akka ask 返回给父演员。

Simulator – Actor Simulator simulates mining requests and transaction submissions sent to the Blockchainer actor on the same node. It generates periodic mining requests by successively calling Akka scheduler function scheduleOnce with random variations of configurable time intervals. Transaction submissions are delegated to actor TransactionFeeder.
模拟器 – Actor Simulator 模拟发送给同一节点上的 Blockchainer 参与者的挖矿请求和事务提交。它通过连续调用 Akka 调度程序函数 scheduleOnce 来生成周期性的挖矿请求,并随机变化可配置的时间间隔。事务提交委托给参与者 TransactionFeeder

TransactionFeeder – This child actor of Simulator periodically submits transactions to actor Blockchainer via an Akka scheduler. Transactions are created with random user accounts and transaction amounts. Accounts are represented by their cryptographic public keys. For demonstration purpose, a number of PKCS#8 PEM keypair files were created and kept under “{project-root}/src/main/resources/keys/” to save initial setup time.
TransactionFeeder – Simulator 的这个子参与者定期通过 Akka 调度程序向参与者 Blockchainer 提交交易。交易是使用随机用户帐户和交易金额创建的。帐户由其加密公钥表示。出于演示目的,创建了许多PKCS#8 PEM密钥对文件并将其保存在“{project-root}/src/main/resources/keys/”下,以节省初始设置时间。

Since the overall functional flow of this application remains the same as the old one, this previously published diagram is also worth noting:
由于此应用程序的整体功能流程与旧应用程序相同,因此此先前发布的图表也值得注意:

Akka Blockchain - functional flow

Source code for the actors can be found under akkablockchain/actor in the GitHub repo.
参与者的源代码可以在 GitHub 存储库中的 akkablockchain/actor 下找到。

Areas that could be feature-enhanced
可以增强功能的区域

This blockchain application is primarily for proof of concept, thus the underlying data structure and security features have been vastly simplified. For it to get a little closer to a real-world cryptocurrency, addition/enhancement of features in a few areas should be considered. The following bullet items are a re-cap of the “Feature enhancement” section from the old blog:
该区块链应用程序主要用于概念验证,因此底层数据结构和安全功能已大大简化。为了使它更接近现实世界的加密货币,应考虑在几个领域添加/增强功能。以下项目符号项是旧博客中“功能增强”部分的回顾:

Data encryption: The transactions stored in the blockchain unencrypted. Individual transaction items could be encrypted, each of which to be stored with the associated cryptographic signature, requiring miners to verify the signature while allowing only those who have the corresponding private key for the transaction items to see the content.
数据加密:存储在区块链中的交易未加密。单个交易项目可以被加密,每个项目都与相关的加密签名一起存储,要求矿工验证签名,同时只允许那些拥有交易项目相应私钥的人查看内容。

Self regulation: A self-regulatory mechanism that adjusts the difficulty level of the Proof-of-Work in accordance with network load would help stabilize the digital currency. For example, in the event of a significant plunge, Bitcoin would impose self-regulatory reduction in the PoW difficulty requirement to temporarily make mining more rewarding that helped dampen the fall.
自我调节:根据网络负载调整工作量证明难度的自我调节机制将有助于稳定数字货币。例如,如果出现大幅暴跌,比特币将自我监管降低PoW难度要求,以暂时使挖矿更有回报,这有助于抑制下跌。

Currency supply: In a cryptocurrency like Bitcoin, issuance of the mining reward by the network is essentially how the digital coins are “minted”. To keep inflation rate under control as the currency supply increases, the rate of coin minting must be proportionately regulated over time. Bitcoin has a periodic “halfing” mechanism that reduces the mining reward by half for every 210,000 blocks added to the blockchain and will cease producing new coins once the total supply reaches 21 million coins.
货币供应:在像比特币这样的加密货币中,网络发放挖矿奖励本质上是数字硬币的“铸造”方式。为了随着货币供应量的增加而控制通货膨胀率,必须随着时间的推移按比例调节硬币铸造率。比特币有一个周期性的“半半”机制,每增加21万个区块,挖矿奖励就会减少一半,一旦总供应量达到2100万个硬币,就会停止生产新硬币。

Blockchain versioning: Versioning of the blockchain would make it possible for future feature enhancement, algorithmic changes or security fix by means of a fork, akin to Bitcoin’s soft/hard forks, without having to discard the old system.
区块链版本控制:区块链的版本控制将使未来的功能增强,算法更改或安全修复成为可能,通过 fork ,类似于比特币的 soft/hard forks ,而不必丢弃旧系统。

User Interface: The existing application focuses mainly on how to operate a blockchain network, thus supplementing it with, say, a Web-based user interface (e.g. using Akka HTTP/Play framework) for miners to participate mining would certainly make it a more user-friendly system.
用户界面:现有的应用程序主要关注如何操作区块链网络,因此补充它,例如,基于Web的用户界面(例如使用Akka HTTP / Play框架)供矿工参与挖矿肯定会使其成为一个更加用户友好的系统。

Sample console log: running on 3 cluster nodes
示例控制台日志:在 3 个群集节点上运行

Running the application and examining its output from the console log would reveal how multiple miners, each on a separate cluster node, “collaboratively” compete for growing a blockchain with a consensual algorithm. In particular, pay attention to:
运行应用程序并检查其从控制台日志中的输出将揭示多个矿工(每个矿工位于单独的集群节点上)如何“协作”竞争使用共识算法发展区块链。特别要注意:

  • attempts by individual miners to succeed or fail in building new blocks due to timeout by the Proof-of-Work routine
    由于工作量证明例程超时,单个矿工尝试成功或失败构建新区块
  • miners’ requests for adding new blocks to their blockchain rejected by their occupied mining actor
    矿工向其区块链添加新区块的请求被其占用的采矿参与者拒绝
  • how the individual copies of the blockchain possessed by the miners “evolve” by accepting the first successfully expanded blockchain done by themselves or their peers
    矿工拥有的区块链的各个副本如何通过接受他们自己或同行完成的第一个成功扩展的区块链来“进化”
  • the number of trials in Proof-of-Work for a given block (i.e. displayed as BLK()) as its rightmost argument (e.g. BLK(p98k, T(fab1, 3099/2), 2021-05-11 18:05:13, 3, 2771128); for more details about how the mining of blockchain works, please see this blog post)
    工作量证明中给定块的试验次数(即显示为 BLK() )作为其最右边的参数(例如 BLK(p98k, T(fab1, 3099/2), 2021-05-11 18:05:13, 3, 2771128) ;有关区块链挖矿如何工作的更多详细信息,请参阅此博客文章)
  • how Akka cluster nodes switch their “leader” upon detecting failure (due to termination by users, network crashes, etc)
    Akka 集群节点在检测到故障(由于用户终止、网络崩溃等)时如何切换其“领导者”

Below is sample output of running the blockchain application with default configuration on 3 cluster nodes.
下面是在 3 个群集节点上使用默认配置运行区块链应用程序的示例输出。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

A Rate-limiter In Akka Stream
阿卡溪流中的速率限制器

Rate-limiting is a common measure for preventing the resource of a given computing service (e.g. an API service) from being swamped by excessive requests. There are various strategies for achieving rate-limiting, but fundamentally it’s about how to limit the frequency of requests from any sources within a set time window. While a rate-limiter can be implemented in many different ways, it’s, by nature, something well-positioned to be crafted as a stream operator.
速率限制是防止给定计算服务(例如 API 服务)的资源被过多请求淹没的常用措施。实现速率限制有多种策略,但从根本上说,它是关于如何在设定的时间范围内限制来自任何来源的请求的频率。虽然速率限制器可以通过许多不同的方式实现,但从本质上讲,它非常适合作为流运算符进行精心设计。

Wouldn’t “throttle()” suffice?
“节流()”还不够吗?

Akka Stream’s versatile stream processing functions make it an appealing option for implementing rate-limiters. It provides stream operators like throttle() with token bucket model for industry-standard rate-limiting. However, directly applying the function to the incoming request elements would mechanically throttle every request, thus “penalizing” requests from all sources when excessive requests were from, say, just a single source.
Akka Stream 的多功能流处理功能使其成为实现速率限制器的有吸引力的选择。它为 throttle() 等流运算符提供了 token bucket 模型,以实现行业标准速率限制。但是,将函数直接应用于传入的请求元素会机械地限制每个请求,从而在过多的请求仅来自单个源时“惩罚”来自所有源的请求。

We need a slightly more sophisticated rate-limiting solution for the computing service to efficiently serve “behaving” callers while not being swamped by “misbehaving” ones.
我们需要一个稍微复杂的速率限制解决方案,用于计算服务,以有效地为“行为”的调用者提供服务,同时不会被“行为不端”的调用者淹没。

Rate-limiting calls to an API service
对 API 服务的速率限制调用

Let’s say we have an API service that we would like to equip with rate-limiting. Incoming requests will be coming through as elements of an input stream. Each incoming request will consist of source-identifying and API-call, represented as a simple case class instance with apiKey being the unique key/id for an API user and apiParam the submitted parameter for the API call:
假设我们有一个 API 服务,我们希望配备速率限制。传入的请求将作为输入流的元素通过。每个传入请求将包含源标识和 API 调用,表示为一个简单的案例类实例,其中 apiKey 是 API 用户的唯一键/ID, apiParam 是 API 调用的提交参数:

case class Request[A](apiKey: String, apiParam: A)
case class Request[A](apiKey: String, apiParam: A)
case class Request[A](apiKey: String, apiParam: A)

A simplistic API call function that takes the apiKey, apiParam and returns a Future may look something like this:
一个简单的 API 调用函数,它接受 apiKeyapiParam 并返回 Future 可能如下所示:

def apiCall[A, B](key: String, param: A)(implicit ec: ExecutionContext): Future[B] = ???
def apiCall[A, B](key: String, param: A)(implicit ec: ExecutionContext): Future[B] = ???
def apiCall[A, B](key: String, param: A)(implicit ec: ExecutionContext): Future[B] = ???

For illustration purpose, we’ll trivialize it to return a String-type Future:
出于说明目的,我们将对其进行简单化以返回字符串类型的 Future:

def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
Future{s"apiResult($key, $param)"}
def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] = Future{s"apiResult($key, $param)"}
def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
  Future{s"apiResult($key, $param)"}

Next, we define the following main attributes for the rate-limiter:
接下来,我们为速率限制器定义以下主要属性:

val timeWindow = 2.seconds
val maxReqs = 10 // Max overall requests within the timeWindow
val maxReqsPerKey = 3 // Max requests per apiKey within the timeWindow
val timeWindow = 2.seconds val maxReqs = 10 // Max overall requests within the timeWindow val maxReqsPerKey = 3 // Max requests per apiKey within the timeWindow
val timeWindow = 2.seconds
val maxReqs = 10       // Max overall requests within the timeWindow
val maxReqsPerKey = 3  // Max requests per apiKey within the timeWindow

Strategy #1: Discard excessive API calls from any sources
策略 #1:丢弃来自任何来源的过多 API 调用

We’ll look into two different filtering strategies that rate-limit calls to our API service. One approach is to limit API calls within the predefined timeWindow from any given apiKey to not more than the maxReqsPerKey value. In other words, those excessive incoming requests with a given apiKey above the maxReqsPerKey limit will be discarded. We can come up with such filtering logic as a FlowShape like below:
我们将研究两种不同的筛选策略,这些策略对 API 服务的调用进行速率限制。一种方法是将预定义 timeWindow 中的 API 调用从任何给定的 apiKey 限制为不超过 maxReqsPerKey 值。换句话说,给定 apiKey 高于 maxReqsPerKey 限制的过多传入请求将被丢弃。我们可以提出这样的过滤逻辑,如 FlowShape ,如下所示:

// Rate-limiting flow that discards excessive API calls from any sources
def keepToLimitPerKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].
map{ g =>
g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) =>
val count = m.getOrElse(req.apiKey, 0) + 1
if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count))
else (acc, m + (req.apiKey -> count))
}._1.toSeq.reverse
}
// Rate-limiting flow that discards excessive API calls from any sources def keepToLimitPerKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]]. map{ g => g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) => val count = m.getOrElse(req.apiKey, 0) + 1 if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count)) else (acc, m + (req.apiKey -> count)) }._1.toSeq.reverse }
// Rate-limiting flow that discards excessive API calls from any sources
def keepToLimitPerKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].
  map{ g =>
    g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) =>
      val count = m.getOrElse(req.apiKey, 0) + 1
      if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count))
      else (acc, m + (req.apiKey -> count))
    }._1.toSeq.reverse
  }

The filtering Flow takes a sequence of requests returns a filtered sequence. By iterating through the input sequence with foldLeft while keeping track of the request count per apiKey with a Map, it keeps only up to the first maxReqsPerKey of requests for any given apiKey.
筛选 Flow 采用一系列请求返回筛选的序列。通过使用 foldLeft 遍历输入序列,同时使用 Map 跟踪每个 apiKey 的请求计数,它最多保留任何给定 apiKey 的请求的前 maxReqsPerKey 个。

Strategy #2: Drop all API calls from any “offending” sources
策略#2:删除来自任何“违规”来源的所有API调用

An alternative strategy is that for any given apiKey, all API calls with the key will be dropped if the count exceeds the maxReqsPerKey value within the timeWindow. Here’s the corresponding filtering Flow:
另一种策略是,对于任何给定的 apiKey ,如果计数超过 timeWindow 中的 maxReqsPerKey 值,则所有使用该键的 API 调用都将被删除。这是相应的过滤 Flow

// Rate-limiting flow that drops all API calls from any offending sources
def dropAllReqsByKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].
map{ g =>
val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _).
collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq
g.filterNot(req => offendingKeys.contains(req.apiKey))
}
// Rate-limiting flow that drops all API calls from any offending sources def dropAllReqsByKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]]. map{ g => val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _). collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq g.filterNot(req => offendingKeys.contains(req.apiKey)) }
// Rate-limiting flow that drops all API calls from any offending sources
def dropAllReqsByKey[A](): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].
  map{ g =>
    val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _).
      collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq
    g.filterNot(req => offendingKeys.contains(req.apiKey))
  }

As shown in the self-explanatory code, this alternative filtering Flow simply identifies which apiKeys originate the count-violating requests per timeWindow and filter out all of their requests.
如不言自明的代码所示,此替代筛选 Flow 仅标识每个 timeWindow 发起违反计数的请求的 apiKey 并筛选出其所有请求。

Grouping API requests in time windows using “groupedWithin()”
使用“groupedWithin()”对时间窗口中的 API 请求进行分组

Now that we’re equipped with a couple of rate-limiting strategies, we’re going to come up with a stream operator that does the appropriate grouping of the API requests. To achieve that, we use Akka Stream function groupedWithin() which divides up a stream into groups of up to a given number of elements received within a time window. It has the following method signature:
现在我们已经配备了几个速率限制策略,我们将提出一个流运算符,它对 API 请求进行适当的分组。为了实现这一点,我们使用 Akka 流函数 groupedWithin() ,它将流分成一组,最多在时间窗口内接收给定数量的元素。它具有以下方法签名:

def groupedWithin(n: Int, d: FiniteDuration): Repr[Seq[Out]]
def groupedWithin(n: Int, d: FiniteDuration): Repr[Seq[Out]]
def groupedWithin(n: Int, d: FiniteDuration): Repr[Seq[Out]]

The function produces chunks of API requests that serve as properly-typed input to be ingested by one of the filtering Flows we’ve created. That seems to fit perfectly into what we need.
该函数生成 API 请求块,这些请求充当正确类型的输入,由我们创建的筛选 Flow 之一引入。这似乎完全符合我们的需要。

Well, there is a caveat though. The groupedWithin() operator emits when the given time interval (i.e. d, which corresponds to timeWindow in our use case) elapses since the previous emission or the specified number of elements (i.e. n, which corresponds to our maxReqs) is buffered — whichever happens first. In essence, if there are more than n elements readily available upstream, the operator will not fulfill our at-most n elements requirement within the time window.
好吧,不过有一个警告。 groupedWithin() 运算符在给定的时间间隔(即 d ,对应于我们用例中的 timeWindow )自上一次发射或指定数量的元素(即 n 对应于我们的 maxReqs ) 被缓冲 — 以先发生者为准。从本质上讲,如果上游有超过 n 个元素随时可用,则运营商将无法在时间窗口内满足我们最多 n 个元素的要求。

A work-around is to subsequently apply the throttle() to the grouped requests as a single batch to enforce the time-windowed rate-limiting requirement.
解决方法是随后将 throttle() 作为单个批处理应用于分组请求,以强制实施时间窗口速率限制要求。

Test-running our API service rate-limiter
测试运行我们的 API 服务速率限制器

Let’s assemble a minuscule stream of requests to test-run our rate-limiter using the first filtering strategy. To make it easy to spot the dropped API requests, we assign the apiParam parameter of each request an integer value that reveals the request’ position in the input stream via zipWithIndex.
让我们组装一个微小的请求流,以使用第一个过滤策略测试运行我们的速率限制器。为了便于发现丢弃的 API 请求,我们为每个请求的 apiParam 参数分配一个整数值,该值通过 zipWithIndex 显示请求在输入流中的位置。

import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream.{ActorMaterializer, ThrottleMode, OverflowStrategy}
import akka.NotUsed
import scala.concurrent.{ExecutionContext, Future}
import scala.concurrent.duration._
implicit val system = ActorSystem("system")
implicit val ec = system.dispatcher
implicit val materializer = ActorMaterializer() // for Akka stream v2.5 or below
case class Request[A](apiKey: String, apiParam: A)
def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
Future{s"apiResult($key, $param)"}
val timeWindow = 2.seconds
val maxReqs = 10
val maxReqsPerKey = 3
val requests: Iterator[Request[Int]] = Vector(
5, 1, 1, 4, 5, 5, 5, 6, 2, 2, // Rogue keys: `5`
1, 5, 5, 2, 3, 4, 6, 6, 4, 4,
5, 4, 3, 3, 4, 4, 4, 1, 3, 3, // Rogue keys: `3` & `4`
6, 1, 1, 4, 4, 1, 1, 5 // Rogue keys: `1`
).
zipWithIndex.
map{ case (x, i) => Request(s"k-$x", i + 1) }.
iterator
Source.fromIterator(() => requests).
groupedWithin(maxReqs, timeWindow).
via(keepToLimitPerKey()). // Rate-limiting strategy #1
throttle(1, timeWindow, 1, ThrottleMode.Shaping).
mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))).
runForeach(println)
| Future(Success(apiResult(k-5, 1)))
| Future(Success(apiResult(k-1, 2)))
| Future(Success(apiResult(k-1, 3)))
| Future(Success(apiResult(k-4, 4)))
| Future(Success(apiResult(k-5, 5)))
| Future(Success(apiResult(k-5, 6))) // Request(k-5, 7) dropped
| Future(Success(apiResult(k-6, 8)))
| Future(Success(apiResult(k-2, 9)))
| Future(Success(apiResult(k-2, 10)))
v <-- ~2 seconds
| Future(Success(apiResult(k-1, 11)))
| Future(Success(apiResult(k-5, 12)))
| Future(Success(apiResult(k-5, 13)))
| Future(Success(apiResult(k-2, 14)))
| Future(Success(apiResult(k-3, 15)))
| Future(Success(apiResult(k-4, 16)))
| Future(Success(apiResult(k-6, 17)))
| Future(<not completed>)
| Future(<not completed>)
| Future(Success(apiResult(k-4, 20)))
v <-- ~2 seconds
| Future(Success(apiResult(k-5, 21)))
| Future(Success(apiResult(k-4, 22)))
| Future(Success(apiResult(k-3, 23)))
| Future(Success(apiResult(k-3, 24)))
| Future(Success(apiResult(k-4, 25)))
| Future(Success(apiResult(k-4, 26))) // Request(k-4, 27) dropped
| Future(Success(apiResult(k-1, 28)))
| Future(Success(apiResult(k-3, 29))) // Request(k-3, 30) dropped
v <-- ~2 seconds
| Future(<not completed>)
| Future(Success(apiResult(k-1, 32)))
| Future(Success(apiResult(k-1, 33)))
| Future(Success(apiResult(k-4, 34)))
| Future(Success(apiResult(k-4, 35)))
| Future(Success(apiResult(k-1, 36))) // Request(k-1, 37) dropped
| Future(Success(apiResult(k-5, 38)))
v <-- ~2 seconds
import akka.actor.ActorSystem import akka.stream.scaladsl._ import akka.stream.{ActorMaterializer, ThrottleMode, OverflowStrategy} import akka.NotUsed import scala.concurrent.{ExecutionContext, Future} import scala.concurrent.duration._ implicit val system = ActorSystem("system") implicit val ec = system.dispatcher implicit val materializer = ActorMaterializer() // for Akka stream v2.5 or below case class Request[A](apiKey: String, apiParam: A) def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] = Future{s"apiResult($key, $param)"} val timeWindow = 2.seconds val maxReqs = 10 val maxReqsPerKey = 3 val requests: Iterator[Request[Int]] = Vector( 5, 1, 1, 4, 5, 5, 5, 6, 2, 2, // Rogue keys: `5` 1, 5, 5, 2, 3, 4, 6, 6, 4, 4, 5, 4, 3, 3, 4, 4, 4, 1, 3, 3, // Rogue keys: `3` & `4` 6, 1, 1, 4, 4, 1, 1, 5 // Rogue keys: `1` ). zipWithIndex. map{ case (x, i) => Request(s"k-$x", i + 1) }. iterator Source.fromIterator(() => requests). groupedWithin(maxReqs, timeWindow). via(keepToLimitPerKey()). // Rate-limiting strategy #1 throttle(1, timeWindow, 1, ThrottleMode.Shaping). mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))). runForeach(println) | Future(Success(apiResult(k-5, 1))) | Future(Success(apiResult(k-1, 2))) | Future(Success(apiResult(k-1, 3))) | Future(Success(apiResult(k-4, 4))) | Future(Success(apiResult(k-5, 5))) | Future(Success(apiResult(k-5, 6))) // Request(k-5, 7) dropped | Future(Success(apiResult(k-6, 8))) | Future(Success(apiResult(k-2, 9))) | Future(Success(apiResult(k-2, 10))) v <-- ~2 seconds | Future(Success(apiResult(k-1, 11))) | Future(Success(apiResult(k-5, 12))) | Future(Success(apiResult(k-5, 13))) | Future(Success(apiResult(k-2, 14))) | Future(Success(apiResult(k-3, 15))) | Future(Success(apiResult(k-4, 16))) | Future(Success(apiResult(k-6, 17))) | Future(<not completed>) | Future(<not completed>) | Future(Success(apiResult(k-4, 20))) v <-- ~2 seconds | Future(Success(apiResult(k-5, 21))) | Future(Success(apiResult(k-4, 22))) | Future(Success(apiResult(k-3, 23))) | Future(Success(apiResult(k-3, 24))) | Future(Success(apiResult(k-4, 25))) | Future(Success(apiResult(k-4, 26))) // Request(k-4, 27) dropped | Future(Success(apiResult(k-1, 28))) | Future(Success(apiResult(k-3, 29))) // Request(k-3, 30) dropped v <-- ~2 seconds | Future(<not completed>) | Future(Success(apiResult(k-1, 32))) | Future(Success(apiResult(k-1, 33))) | Future(Success(apiResult(k-4, 34))) | Future(Success(apiResult(k-4, 35))) | Future(Success(apiResult(k-1, 36))) // Request(k-1, 37) dropped | Future(Success(apiResult(k-5, 38))) v <-- ~2 seconds
import akka.actor.ActorSystem
import akka.stream.scaladsl._
import akka.stream.{ActorMaterializer, ThrottleMode, OverflowStrategy}
import akka.NotUsed
import scala.concurrent.{ExecutionContext, Future}
import scala.concurrent.duration._

implicit val system = ActorSystem("system")
implicit val ec = system.dispatcher
implicit val materializer = ActorMaterializer()  // for Akka stream v2.5 or below

case class Request[A](apiKey: String, apiParam: A)

def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
  Future{s"apiResult($key, $param)"}

val timeWindow = 2.seconds
val maxReqs = 10
val maxReqsPerKey = 3

val requests: Iterator[Request[Int]] = Vector(
    5, 1, 1, 4, 5, 5, 5, 6, 2, 2,  // Rogue keys: `5`
    1, 5, 5, 2, 3, 4, 6, 6, 4, 4,
    5, 4, 3, 3, 4, 4, 4, 1, 3, 3,  // Rogue keys: `3` & `4` 
    6, 1, 1, 4, 4, 1, 1, 5         // Rogue keys: `1`
  ).
  zipWithIndex.
  map{ case (x, i) => Request(s"k-$x", i + 1) }.
  iterator

Source.fromIterator(() => requests).
  groupedWithin(maxReqs, timeWindow).
  via(keepToLimitPerKey()).  // Rate-limiting strategy #1
  throttle(1, timeWindow, 1, ThrottleMode.Shaping).
  mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))).
  runForeach(println)

| Future(Success(apiResult(k-5, 1)))
| Future(Success(apiResult(k-1, 2)))
| Future(Success(apiResult(k-1, 3)))
| Future(Success(apiResult(k-4, 4)))
| Future(Success(apiResult(k-5, 5)))
| Future(Success(apiResult(k-5, 6)))   // Request(k-5, 7) dropped
| Future(Success(apiResult(k-6, 8)))
| Future(Success(apiResult(k-2, 9)))
| Future(Success(apiResult(k-2, 10)))
v <-- ~2 seconds
| Future(Success(apiResult(k-1, 11)))
| Future(Success(apiResult(k-5, 12)))
| Future(Success(apiResult(k-5, 13)))
| Future(Success(apiResult(k-2, 14)))
| Future(Success(apiResult(k-3, 15)))
| Future(Success(apiResult(k-4, 16)))
| Future(Success(apiResult(k-6, 17)))
| Future(<not completed>)
| Future(<not completed>)
| Future(Success(apiResult(k-4, 20)))
v <-- ~2 seconds
| Future(Success(apiResult(k-5, 21)))
| Future(Success(apiResult(k-4, 22)))
| Future(Success(apiResult(k-3, 23)))
| Future(Success(apiResult(k-3, 24)))
| Future(Success(apiResult(k-4, 25)))
| Future(Success(apiResult(k-4, 26)))  // Request(k-4, 27) dropped
| Future(Success(apiResult(k-1, 28)))
| Future(Success(apiResult(k-3, 29)))  // Request(k-3, 30) dropped
v <-- ~2 seconds
| Future(<not completed>)
| Future(Success(apiResult(k-1, 32)))
| Future(Success(apiResult(k-1, 33)))
| Future(Success(apiResult(k-4, 34)))
| Future(Success(apiResult(k-4, 35)))
| Future(Success(apiResult(k-1, 36)))  // Request(k-1, 37) dropped
| Future(Success(apiResult(k-5, 38)))
v <-- ~2 seconds

Note that mapConcat() is for flattening the stream of grouped API requests back to a stream of individual requests in their original order.
请注意, mapConcat() 用于将分组 API 请求流平展回按原始顺序排列的单个请求流。

Next, we test-run our rate-limiter using the alternative filtering strategy with the same input stream and timeWindow/maxReqs/maxReqsPerKey parameters:
接下来,我们使用具有相同输入流和 timeWindow/maxReqs/maxReqsPerKey 参数的替代过滤策略测试运行我们的速率限制器:

val requests: Iterator[Request[Int]] = Vector(
5, 1, 1, 4, 5, 5, 5, 6, 2, 2, // Rogue keys: `5`
1, 5, 5, 2, 3, 4, 6, 6, 4, 4,
5, 4, 3, 3, 4, 4, 4, 1, 3, 3, // Rogue keys: `3` & `4`
6, 1, 1, 4, 4, 1, 1, 5 // Rogue keys: `1`
).
zipWithIndex.
map{ case (x, i) => Request(s"k-$x", i + 1) }.
iterator
Source.fromIterator(() => requests).
groupedWithin(maxReqs, timeWindow).
via(dropAllReqsByKey()). // Rate-limiting strategy #2
throttle(1, timeWindow, 1, ThrottleMode.Shaping).
mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))).
runForeach(println)
| Future(Success(apiResult(k-1, 2)))
| Future(<not completed>)
| Future(Success(apiResult(k-4, 4)))
| Future(Success(apiResult(k-6, 8)))
| Future(Success(apiResult(k-2, 9)))
| Future(Success(apiResult(k-2, 10))) // All requests by k-5 dropped
v <-- ~2 seconds
| Future(<not completed>)
| Future(Success(apiResult(k-5, 12)))
| Future(Success(apiResult(k-5, 13)))
| Future(Success(apiResult(k-2, 14)))
| Future(Success(apiResult(k-3, 15)))
| Future(Success(apiResult(k-4, 16)))
| Future(Success(apiResult(k-6, 17)))
| Future(Success(apiResult(k-6, 18)))
| Future(Success(apiResult(k-4, 19)))
| Future(Success(apiResult(k-4, 20)))
v <-- ~2 seconds
| Future(Success(apiResult(k-5, 21)))
| Future(Success(apiResult(k-1, 28))) // All requests by k-3 or k-4 dropped
v <-- ~2 seconds
| Future(Success(apiResult(k-6, 31)))
| Future(Success(apiResult(k-4, 34)))
| Future(Success(apiResult(k-4, 35)))
| Future(Success(apiResult(k-5, 38))) // All requests by k-1 dropped
v <-- ~2 seconds
val requests: Iterator[Request[Int]] = Vector( 5, 1, 1, 4, 5, 5, 5, 6, 2, 2, // Rogue keys: `5` 1, 5, 5, 2, 3, 4, 6, 6, 4, 4, 5, 4, 3, 3, 4, 4, 4, 1, 3, 3, // Rogue keys: `3` & `4` 6, 1, 1, 4, 4, 1, 1, 5 // Rogue keys: `1` ). zipWithIndex. map{ case (x, i) => Request(s"k-$x", i + 1) }. iterator Source.fromIterator(() => requests). groupedWithin(maxReqs, timeWindow). via(dropAllReqsByKey()). // Rate-limiting strategy #2 throttle(1, timeWindow, 1, ThrottleMode.Shaping). mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))). runForeach(println) | Future(Success(apiResult(k-1, 2))) | Future(<not completed>) | Future(Success(apiResult(k-4, 4))) | Future(Success(apiResult(k-6, 8))) | Future(Success(apiResult(k-2, 9))) | Future(Success(apiResult(k-2, 10))) // All requests by k-5 dropped v <-- ~2 seconds | Future(<not completed>) | Future(Success(apiResult(k-5, 12))) | Future(Success(apiResult(k-5, 13))) | Future(Success(apiResult(k-2, 14))) | Future(Success(apiResult(k-3, 15))) | Future(Success(apiResult(k-4, 16))) | Future(Success(apiResult(k-6, 17))) | Future(Success(apiResult(k-6, 18))) | Future(Success(apiResult(k-4, 19))) | Future(Success(apiResult(k-4, 20))) v <-- ~2 seconds | Future(Success(apiResult(k-5, 21))) | Future(Success(apiResult(k-1, 28))) // All requests by k-3 or k-4 dropped v <-- ~2 seconds | Future(Success(apiResult(k-6, 31))) | Future(Success(apiResult(k-4, 34))) | Future(Success(apiResult(k-4, 35))) | Future(Success(apiResult(k-5, 38))) // All requests by k-1 dropped v <-- ~2 seconds
val requests: Iterator[Request[Int]] = Vector(
    5, 1, 1, 4, 5, 5, 5, 6, 2, 2,  // Rogue keys: `5`
    1, 5, 5, 2, 3, 4, 6, 6, 4, 4,
    5, 4, 3, 3, 4, 4, 4, 1, 3, 3,  // Rogue keys: `3` & `4` 
    6, 1, 1, 4, 4, 1, 1, 5         // Rogue keys: `1`
  ).
  zipWithIndex.
  map{ case (x, i) => Request(s"k-$x", i + 1) }.
  iterator

Source.fromIterator(() => requests).
  groupedWithin(maxReqs, timeWindow).
  via(dropAllReqsByKey()).  // Rate-limiting strategy #2
  throttle(1, timeWindow, 1, ThrottleMode.Shaping).
  mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))).
  runForeach(println)

| Future(Success(apiResult(k-1, 2)))
| Future(<not completed>)
| Future(Success(apiResult(k-4, 4)))
| Future(Success(apiResult(k-6, 8)))
| Future(Success(apiResult(k-2, 9)))
| Future(Success(apiResult(k-2, 10)))  // All requests by k-5 dropped
v <-- ~2 seconds
| Future(<not completed>)
| Future(Success(apiResult(k-5, 12)))
| Future(Success(apiResult(k-5, 13)))
| Future(Success(apiResult(k-2, 14)))
| Future(Success(apiResult(k-3, 15)))
| Future(Success(apiResult(k-4, 16)))
| Future(Success(apiResult(k-6, 17)))
| Future(Success(apiResult(k-6, 18)))
| Future(Success(apiResult(k-4, 19)))
| Future(Success(apiResult(k-4, 20)))
v <-- ~2 seconds
| Future(Success(apiResult(k-5, 21)))
| Future(Success(apiResult(k-1, 28)))  // All requests by k-3 or k-4 dropped
v <-- ~2 seconds
| Future(Success(apiResult(k-6, 31)))
| Future(Success(apiResult(k-4, 34)))
| Future(Success(apiResult(k-4, 35)))
| Future(Success(apiResult(k-5, 38)))  // All requests by k-1 dropped
v <-- ~2 seconds

Wrapping the rate-limiter in a class
将速率限制器包装在类中

To generalize the rate-limiter, we can create a wrapper class that parameterizes apiCall and filteringStrategy along with the timeWindow, maxReqs, maxReqsPerKey parameters.
为了推广速率限制器,我们可以创建一个包装类,该类参数化 apiCallfilteringStrategy 以及 timeWindowmaxReqsmaxReqsPerKey 参数。

case class Request[A](apiKey: String, apiParam: A)
case class RateLimiter[A, B](apiCall: (String, A) => Future[B],
filteringStrategy: Int => Flow[Seq[Request[A]], Seq[Request[A]], NotUsed],
timeWindow: FiniteDuration,
maxReqs: Int,
maxReqsPerKey: Int)(implicit ec: ExecutionContext) {
def flow(): Flow[Request[A], Future[B], NotUsed] =
Flow[Request[A]].
groupedWithin(maxReqs, timeWindow).
via(filteringStrategy(maxReqsPerKey)).
throttle(1, timeWindow, 1, ThrottleMode.Shaping).
mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam)))
}
object RateLimiter {
// Rate-limiting flow that discards excessive API calls from any sources
def keepToLimitPerKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] =
Flow[Seq[Request[A]]].map{ g =>
g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) =>
val count = m.getOrElse(req.apiKey, 0) + 1
if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count))
else (acc, m + (req.apiKey -> count))
}._1.toSeq.reverse
}
// Rate-limiting flow that drops all API calls from any offending sources
def dropAllReqsByKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] =
Flow[Seq[Request[A]]].map{ g =>
val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _).
collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq
g.filterNot(req => offendingKeys.contains(req.apiKey))
}
}
case class Request[A](apiKey: String, apiParam: A) case class RateLimiter[A, B](apiCall: (String, A) => Future[B], filteringStrategy: Int => Flow[Seq[Request[A]], Seq[Request[A]], NotUsed], timeWindow: FiniteDuration, maxReqs: Int, maxReqsPerKey: Int)(implicit ec: ExecutionContext) { def flow(): Flow[Request[A], Future[B], NotUsed] = Flow[Request[A]]. groupedWithin(maxReqs, timeWindow). via(filteringStrategy(maxReqsPerKey)). throttle(1, timeWindow, 1, ThrottleMode.Shaping). mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam))) } object RateLimiter { // Rate-limiting flow that discards excessive API calls from any sources def keepToLimitPerKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].map{ g => g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) => val count = m.getOrElse(req.apiKey, 0) + 1 if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count)) else (acc, m + (req.apiKey -> count)) }._1.toSeq.reverse } // Rate-limiting flow that drops all API calls from any offending sources def dropAllReqsByKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] = Flow[Seq[Request[A]]].map{ g => val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _). collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq g.filterNot(req => offendingKeys.contains(req.apiKey)) } }
case class Request[A](apiKey: String, apiParam: A)

case class RateLimiter[A, B](apiCall: (String, A) => Future[B],
                             filteringStrategy: Int => Flow[Seq[Request[A]], Seq[Request[A]], NotUsed],
                             timeWindow: FiniteDuration,
                             maxReqs: Int,
                             maxReqsPerKey: Int)(implicit ec: ExecutionContext) {
  def flow(): Flow[Request[A], Future[B], NotUsed] =
    Flow[Request[A]].
      groupedWithin(maxReqs, timeWindow).
      via(filteringStrategy(maxReqsPerKey)).
      throttle(1, timeWindow, 1, ThrottleMode.Shaping).
      mapConcat(_.map(req => apiCall(req.apiKey, req.apiParam)))
}

object RateLimiter {
  // Rate-limiting flow that discards excessive API calls from any sources
  def keepToLimitPerKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] =
    Flow[Seq[Request[A]]].map{ g =>
      g.foldLeft((List.empty[Request[A]], Map.empty[String, Int])){ case ((acc, m), req) =>
        val count = m.getOrElse(req.apiKey, 0) + 1
        if (count <= maxReqsPerKey) (req :: acc, m + (req.apiKey -> count))
        else (acc, m + (req.apiKey -> count))
      }._1.toSeq.reverse
    }

  // Rate-limiting flow that drops all API calls from any offending sources
  def dropAllReqsByKey[A](maxReqsPerKey: Int): Flow[Seq[Request[A]], Seq[Request[A]], akka.NotUsed] =
    Flow[Seq[Request[A]]].map{ g =>
      val offendingKeys = g.groupMapReduce(_.apiKey)(_ => 1)(_ + _).
        collect{ case (key, cnt) if cnt > maxReqsPerKey => key }.toSeq
      g.filterNot(req => offendingKeys.contains(req.apiKey))
    }
}

Note that implementations of any available filtering strategies are now kept within the RateLimiter companion object.
请注意,任何可用筛选策略的实现现在都保留在 RateLimiter 伴随对象中。

A “biased” random-number function
“有偏差”的随机数函数

Let’s also create a simple function for generating “biased” random integers for test-running the rate-limiter class.
我们还创建一个简单的函数,用于生成“有偏差的”随机整数,以测试运行速率限制器类。

def biasedRandNum(l: Int, u: Int, biasedNums: Set[Int], biasedFactor: Int = 1): Int = {
def rand = java.util.concurrent.ThreadLocalRandom.current
Vector.
iterate(rand.nextInt(l, u+1), biasedFactor)(_ => rand.nextInt(l, u+1)).
dropWhile(!biasedNums.contains(_)).
headOption match {
case Some(n) => n
case None => rand.nextInt(l, u+1)
}
}
def biasedRandNum(l: Int, u: Int, biasedNums: Set[Int], biasedFactor: Int = 1): Int = { def rand = java.util.concurrent.ThreadLocalRandom.current Vector. iterate(rand.nextInt(l, u+1), biasedFactor)(_ => rand.nextInt(l, u+1)). dropWhile(!biasedNums.contains(_)). headOption match { case Some(n) => n case None => rand.nextInt(l, u+1) } }
def biasedRandNum(l: Int, u: Int, biasedNums: Set[Int], biasedFactor: Int = 1): Int = {
  def rand = java.util.concurrent.ThreadLocalRandom.current 
  Vector.
    iterate(rand.nextInt(l, u+1), biasedFactor)(_ => rand.nextInt(l, u+1)).
    dropWhile(!biasedNums.contains(_)).
    headOption match {
      case Some(n) => n
      case None => rand.nextInt(l, u+1)
    }
}

Method biasedRandNum() simply generates a random integer within a given range that skews towards elements in the provided biasedNums list. The biasedFactor (e.g. 0, 1, 2, …) influences the skew-level by forcing the random number generator to repeat “biased” trials, with 0 representing no-bias. A larger biasedFactor value will increase the skew.
方法 biasedRandNum() 只是在给定范围内生成一个随机整数,该整数偏向提供的 biasedNums 列表中的元素。 biasedFactor (例如 0, 1, 2, ...)通过强制随机数生成器重复“偏倚”试验来影响偏斜水平,其中 0 表示无偏置。较大的 biasedFactor 值将增加倾斜度。

For example, biasedRandNum(0, 9, Set(1, 3, 5)) will generate a random integer between 0 and 9 (inclusive), skewing towards generating 1, 3 or 5 with the default biasedFactor = 1.
例如, biasedRandNum(0, 9, Set(1, 3, 5)) 将生成一个介于 0 和 9(含)之间的随机整数,倾向于使用默认值 biasedFactor = 1 生成 1、3 或 5。

Test-running the rate-limiter class with random data
使用随机数据测试运行速率限制器类

def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
Future{s"apiResult($key, $param)"}
val requests: Iterator[Request[Int]] = Vector.tabulate(1200)(_ => biasedRandNum(0, 9, Set(1, 3, 5), 2)).
zipWithIndex.
map{ case (x, i) => Request(s"k-$x", i + 1) }.
iterator
Source.fromIterator(() => requests).
via(RateLimiter(apiCall, RateLimiter.dropAllReqsByKey[Int], 2.seconds, 500, 20).flow()).
runForeach(println)
def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] = Future{s"apiResult($key, $param)"} val requests: Iterator[Request[Int]] = Vector.tabulate(1200)(_ => biasedRandNum(0, 9, Set(1, 3, 5), 2)). zipWithIndex. map{ case (x, i) => Request(s"k-$x", i + 1) }. iterator Source.fromIterator(() => requests). via(RateLimiter(apiCall, RateLimiter.dropAllReqsByKey[Int], 2.seconds, 500, 20).flow()). runForeach(println)
def apiCall[A](key: String, param: A)(implicit ec: ExecutionContext): Future[String] =
  Future{s"apiResult($key, $param)"}

val requests: Iterator[Request[Int]] = Vector.tabulate(1200)(_ => biasedRandNum(0, 9, Set(1, 3, 5), 2)).
  zipWithIndex.
  map{ case (x, i) => Request(s"k-$x", i + 1) }.
  iterator

Source.fromIterator(() => requests).
  via(RateLimiter(apiCall, RateLimiter.dropAllReqsByKey[Int], 2.seconds, 500, 20).flow()).
  runForeach(println)

In the above example, you’ll see in the output a batch of up to 500 elements get printed for every couple of seconds. The “biasedFactor” is set to 2 significantly skewing the random apiKey values towards the biasedNums elements 1, 3 and 5, and since filtering strategy dropAllReqsByKey is chosen, a likely observation is that all requests with apiKey k-1, k-3 or k-5 will be dropped by the rate-limiter.
在上面的示例中,您将在输出中看到每隔几秒钟打印一批最多 500 个元素。“biasedFactor”设置为 2 ,使随机 apiKey 值明显偏向{2}}元素1,3和5,并且由于选择了过滤策略 dropAllReqsByKey ,因此可能的观察结果是,速率限制器将丢弃所有具有 apiKey k-1,k-3或k-5的请求。

I’ll leave it to the readers to experiment with the rate-limiter by changing the values of parameters in biasedRandNum() as well as constructor fields in class RateLimiter.
我将留给读者通过更改biasedRandNum()中的参数值以及类 RateLimiter 中的构造函数字段来试验速率限制器。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Streaming ETL With Alpakka Kafka
使用 Alpakka Kafka 进行流式 ETL

In a previous startup I cofounded, our core product was a geospatial application that provided algorithmic ratings of the individual residential real estate properties for home buyers. Given that there were over 100+ millions of residential properties nationwide, the collective data volume of all the associated attributes necessary for the data engineering work was massive.
在我之前共同创立的一家初创公司中,我们的核心产品是一个地理空间应用程序,为购房者提供个人住宅房地产的算法评级。鉴于全国有超过100 +数百万个住宅物业,数据工程工作所需的所有相关属性的集体数据量是巨大的。

For the initial MVP (minimum viable product) releases in which we only needed to showcase our product features in a selected metropolitan area, we used PostgreSQL as the OLTP (online transaction processing) database. Leveraging Postgres’ table partitioning feature, we had an OLTP database capable of accommodating incremental geographical expansion into multiple cities and states.
对于最初的MVP(最小可行产品)版本,我们只需要在选定的大都市地区展示我们的产品功能,我们使用PostgreSQL作为OLTP(在线事务处理)数据库。利用Postgres的表分区功能,我们有一个OLTP数据库,能够适应向多个城市和州的增量地理扩展。

Batch ETL

The need for a big data warehouse wasn’t imminent in the beginning, though we had to make sure a data processing platform for a highly scalable data warehouse along with efficient ETL (extract/transform/load) functions would be ready on a short notice. The main objective was to make sure the OLTP database could be kept at a minimal volume while less frequently used data got “archived” off to a big data warehouse for data analytics.
对大数据仓库的需求一开始并不是迫在眉睫的,尽管我们必须确保高度可扩展的数据仓库的数据处理平台以及高效的ETL(提取/转换/加载)功能能够在短时间内准备就绪。主要目标是确保OLTP数据库可以保持在最小的数量,同时将不常用的数据“存档”到大数据仓库进行数据分析。

With limited engineering resources available in a small startup, I kicked off a R&D project on the side to build programmatic ETL processes to periodically funnel data from PostgreSQL to a big data warehouse in a batch manner. Cassandra was chosen to be the data warehouse and was configured on an Amazon EC2 cluster. The project was finished with a batch ETL solution that functionally worked as intended, although back in my mind a more “continuous” operational model would be preferred.
在一家小型初创公司可用的工程资源有限的情况下,我启动了一个研发项目,以构建编程ETL流程,以批处理方式定期将数据从PostgreSQL汇集到大数据仓库。Cassandra 被选为数据仓库,并在 Amazon EC2 集群上进行配置。该项目以批量 ETL 解决方案完成,该解决方案在功能上按预期工作,尽管在我看来,更“连续”的操作模型是首选。

Real-time Streaming ETL
实时流式 ETL

Fast-forward to 2021, I recently took on a big data streaming project that involves ETL and building data pipelines on a distributed platform. Central to the project requirement is real-time (or more precisely, near real-time) processing of high-volume data. Another aspect of the requirement is that the streaming system has to accommodate custom data pipelines as composable components of the consumers, suggesting that a streaming ETL solution would be more suitable than a batch one. Lastly, stream consumption needs to guarantee at-least-once delivery.
快进到 2021 年,我最近接手了一个大数据流项目,该项目涉及 ETL 和在分布式平台上构建数据管道。项目需求的核心是实时(或更准确地说,近乎实时)处理大量数据。要求的另一个方面是,流系统必须将自定义数据管道作为消费者的可组合组件,这表明流 ETL 解决方案比批处理 ETL 解决方案更合适。最后,流消耗需要保证至少一次交付。

Given all that, Apache Kafka promptly stood out as a top candidate to serve as the distributed streaming brokers. In particular, its capability of keeping durable data in a distributed fault-tolerant cluster allows it to serve different consumers at various instances of time and locales. Next, Akka Stream was added to the tech stack for its versatile stream-based application integration functionality as well as benefits of reactive streams.
鉴于所有这些,Apache Kafka迅速脱颖而出,成为分布式流媒体经纪人的最佳候选人。特别是,它能够在分布式容错集群中保存持久数据,使其能够在不同的时间和区域设置实例中为不同的使用者提供服务。接下来,Akka Stream 因其多功能的基于流的应用程序集成功能以及反应式流的优势而被添加到技术堆栈中。

Alpakka – a reactive stream API and DSL
Alpakka——一个反应式流API和DSL

Built on top of Akka Stream, Alpakka provides a comprehensive API and DSL (domain specific language) for reactive and stream-oriented programming to address the application integration needs for interoperating with a wide range of prominent systems across various computing domains. That, coupled with the underlying Akka Stream’s versatile streaming functions, makes Alpakka a powerful toolkit for what is needed.
Alpakka建立在Akka Stream之上,为响应式和面向流的编程提供了全面的API和DSL(领域特定语言),以满足与各种计算领域的各种突出系统进行互操作的应用程序集成需求。这一点,再加上底层 Akka Stream 的多功能流媒体功能,使 Alpakka成为满足所需内容的强大工具包。

In this blog post, we’ll assemble in Scala a producer and a consumer using the Alpakka API to perform streaming ETL from a PostgreSQL database through Kafka brokers into a Cassandra data warehouse. In a subsequent post, we’ll enhance and package up these snippets to address the requirement of at-least-once delivery in consumption and composability of data pipelines.
在这篇博文中,我们将在 Scala 中组装一个生产者和一个消费者,使用 Alpakka API 通过 Kafka 代理从 PostgreSQL 数据库执行流式 ETL 到 Cassandra 数据仓库。在后续的文章中,我们将增强和打包这些代码片段,以满足数据管道的使用和可组合性中至少一次交付的要求。

Streaming ETL with Alpakka Kafka, Slick, Cassandra, …
Streaming ETL with Alpakka Kafka, Slick, Cassandra, ...

The following diagram shows the near-real time ETL functional flow of data streaming from various kinds of data sources (e.g. a PostgreSQL database or a CSV file) to data destinations (e.g. a Cassandra data warehouse or a custom data stream outlet).
下图显示了从各种数据源(例如 PostgreSQL 数据库或 CSV 文件)到数据目标(例如 Cassandra 数据仓库或自定义数据流出口)的数据流的近实时 ETL 功能流。

Alpakka Kafka - Streaming ETL

The Apache Kafka brokers provide a distributed publish-subscribe platform for keeping in-flight data in durable immutable logs readily available for consumption. Meanwhile, the Akka Stream based Alpakka API that comes with a DSL allows programmatic integrations to compose data pipelines as sources, sinks and flows, in addition to enabling “reactivity” by equipping the streams with non-blocking backpressure.
Apache Kafka 代理提供了一个分布式发布-订阅平台,用于将动态数据保存在随时可供使用的持久不可变日志中。同时,带有DSL的基于Akka Stream的AlpakkaAPI允许编程集成将数据管道组合为 sourcessinksflows ,此外还通过为流配备非阻塞背压来实现“反应性”。

It should be noted that the same stream can be processed using various data sources and destinations simultaneously. For instance, data with the same schema from both the CSV file and Postgres database could be published to the same topic and consumed by a consumer group designated for the Cassandra database and another consumer group for a different data storage.
应该注意的是,可以同时使用各种数据源和目标处理同一流。例如,来自 CSV 文件和 Postgres 数据库的具有相同模式的数据可以发布到同一主题,并由指定用于 Cassandra 数据库的使用者组和用于不同数据存储的另一个使用者组使用。

Example: ETL of real estate property listing data
示例:房地产列表数据的 ETL

The platform will be for general-purpose ETL/pipelining. For illustration purpose in this blog post, we’re going to use it to perform streaming ETL of some simplified dataset of residential real estate property listings.
该平台将用于通用 ETL/流水线。为了说明这篇博文,我们将使用它来执行住宅房地产列表的一些简化数据集的流式 ETL。

First, we create a simple class to represent a property listing.
首先,我们创建一个简单的类来表示属性列表。

Using the good old sbt as the build tool, relevant library dependencies for Akka Stream, Alpakka Kafka, Postgres/Slick and Cassandra/DataStax are included in build.sbt.
使用旧的 sbt 作为构建工具,Akka Stream,Alpakka Kafka,Postgres/Slick和Cassandra/DataStax的相关库依赖项包含在 build.sbt 中。

Next, we put configurations for Akka Actor, Alpakka Kafka, Slick and Cassandra in application.conf under src/main/resources/:
接下来,我们将 Akka Actor、Alpakka Kafka、Slick 和 Cassandra 的配置放在 application.confsrc/main/resources/ 下:

Note that the sample configuration is for running the application with all Kafka, PostgreSQL and Cassandra on a single computer. The host IPs (i.e. 127.0.0.1) should be replaced with their corresponding host IPs/names in case they’re on separate hosts. For example, relevant configurations for Kafka brokers and Cassandra database spanning across multiple hosts might look something like bootstrap.servers = "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092" and contact-points = ["10.2.0.1:9042","10.2.0.2:9042"].
请注意,示例配置用于在一台计算机上运行所有Kafka,PostgreSQL和Cassandra的应用程序。主机 IP(即 127.0.0.1 ) 应替换为其相应的主机 IP/名称,以防它们位于不同的主机上。例如,跨多个主机的 Kafka 代理和 Cassandra 数据库的相关配置可能类似于 bootstrap.servers = "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092"contact-points = ["10.2.0.1:9042","10.2.0.2:9042"]

PostgresProducerPlain – an Alpakka Kafka producer
PostgresProducerPlain – Alpakka Kafka制片人

The PostgresProducerPlain snippet below creates a Kafka producer using Alpakka Slick which allows SQL queries to be coded in Slick’s functional programming style.
下面的 PostgresProducerPlain 代码片段使用 Alpakka Slick 创建了一个 Kafka 生产者,它允许以 Slick 的函数式编程风格编码 SQL 查询。

Method Slick.source[T]() takes a streaming query and returns a Source[T, NotUsed]. In this case, T is PropertyListing. Note that Slick.source() can also take a plain SQL statement wrapped within sql"..." as its argument, if wanted (in which case an implicit value of slick.jdbc.GetResult should be defined).
方法 Slick.source[T]() 接受流查询并返回 Source[T, NotUsed] 。在本例中, TPropertyListing 。请注意,如果需要, Slick.source() 也可以采用包装在 sql"..." 中的纯 SQL 语句作为其参数(在这种情况下,应定义隐式值 slick.jdbc.GetResult )。

A subsequent map wraps each of the property listing objects in a ProducerRecord[K,V] with topic and key/value of type String/JSON, before publishing to the Kafka topic via Alpakka Kafka’s Producer.plainSink[K,V].
随后的 map 将每个属性列表对象包装在 ProducerRecord[K,V] 中,使用 String/JSON 类型的 topickey/value ,然后通过 Alpakka Kafka 的 Producer.plainSink[K,V] 发布到 Kafka topic

To run PostgresProducerPlain, simply navigate to the project root and execute the following command from within a command line terminal:
要运行 PostgresProducerPlain ,只需导航到项目根目录并从命令行终端中执行以下命令:

CassandraConsumerPlain – an Alpakka Kafka consumer
CassandraConsumerPlain – Alpakka Kafka 消费者

Using Alpakka Kafka, CassandraConsumerPlain shows how a basic Kafka consumer can be formulated as an Akka stream that consumes data from Kafka via Consumer.plainSource followed by a stream processing operator, Alpakka Cassandra’s CassandraFlow to stream the data into a Cassandra database.
使用 Alpakka Kafka, CassandraConsumerPlain 展示了如何将基本的 Kafka 使用者表述为 Akka 流,该流通过 Consumer.plainSource 使用来自 Kafka 的数据,然后是流处理运算符 Alpakka Cassandra 的 CassandraFlow 将数据流式传输到 Cassandra 数据库中。

A few notes:
一些注意事项:

  • Consumer.plainSource: As a first stab at building a consumer, we use Alpakka Kafka’s Consumer.plainSource[K,V] as the stream source. To ensure the stream to be stopped in a controlled fashion, Consumer.Drainingcontrol is included when composing the stream graph. While it’s straight forward to use, plainSource doesn’t offer programmatic tracking of the commit offset position thus cannot guarantee at-least-once delivery. An enhanced version of the consumer will be constructed in a subsequent blog post.
    Consumer.plainSource:作为构建消费者的第一步,我们使用Alpakka Kafka的Consumer.plainSource[K,V]作为流源。为了确保流以受控方式停止,在编写流图时包括 Consumer.Drainingcontrol。虽然使用起来很简单,但 plainSource 不提供对提交偏移位置的编程跟踪,因此不能保证至少一次交付。消费者的增强版本将在后续博客文章中构建。
  • Partition key: Cassandra mandates having a partition key as part of the primary key of every table for distributing across cluster nodes. In our property listing data, we make the modulo of the Postgres primary key property_id by the number of partitions to be the partition key. It could certainly be redefined to something else (e.g. locale or type of the property) in accordance with the specific business requirement.
    分区键:Cassandra 要求将分区键作为每个表的主键的一部分,以便在群集节点之间分布。在我们的属性列表数据中,我们将 Postgres 主键的模数 property_id 作为分区键。当然可以根据特定的业务需求将其重新定义为其他内容(例如区域设置或属性类型)。
  • CassandraSource: Method query() simply executes queries against a Cassandra database using CassandraSource which takes a CQL query with syntax similar to standard SQL’s. It isn’t part of the consumer flow, but is rather as a convenient tool for verifying stream consumption result.
    CassandraSource:方法 query() 只是使用CassandraSource对Cassandra数据库执行查询,该查询采用CQL查询,其语法类似于标准SQL。它不是使用者流的一部分,而是作为验证流消耗结果的便捷工具。
  • CassandraFlow: Alpakka’s CassandraFlow.create[S]() is the main processing operator responsible for streaming data into the Cassandra database. It takes a CQL PreparedStatement and a “statement binder” that binds the incoming class variables to the corresponding Cassandra columns before executing the insert/update. In this case, S is ConsumerRecord[K,V].
    CassandraFlow:Alpakka 的 CassandraFlow.create[S]() 是负责将数据流式传输到 Cassandra 数据库的主要处理运算符。它需要一个 CQL PreparedStatement 和一个“语句绑定器”,在执行插入/更新之前将传入的类变量绑定到相应的 Cassandra 列。在本例中, SConsumerRecord[K,V]

To run CassandraConsumerPlain, Navigate to the project root and execute the following from within a command line terminal:
若要运行 CassandraConsumerPlain ,请导航到项目根目录并从命令行终端中执行以下操作:

Table schema in PostgreSQL & Cassandra
PostgreSQL 和 Cassandra 中的表模式

Obviously, the streaming ETL application is supposed to run in the presence of one or more Kafka brokers, a PostgreSQL database and a Cassandra data warehouse. For proof of concept, getting all these systems with basic configurations on a descent computer (Linux, Mac OS, etc) is a trivial exercise. The ETL application is readily scalable that it would require only configurative changes when, say, needs rise for scaling up of Kafka and Cassandra to span clusters of nodes in the cloud.
显然,流式ETL应用程序应该在一个或多个Kafka代理,PostgreSQL数据库和Cassandra数据仓库的存在下运行。为了验证概念,在下降计算机(Linux,Mac OS等)上获取所有这些具有基本配置的系统是一项微不足道的工作。ETL应用程序很容易扩展,当需要扩展Kafka和Cassandra以跨越云中的节点集群时,它只需要配置更改。

Below is how the table schema of property_listing can be created in PostgreSQL via psql:
以下是如何通过 psql 在 PostgreSQL 中创建 property_listing 的表模式:

To create keyspace propertydata and the corresponding table property_listing in Cassandra, one can launch cqlsh and execute the following CQL statements:
要在 Cassandra 中创建密钥空间 propertydata 和相应的表 property_listing ,可以启动 cqlsh 并执行以下 CQL 语句:

What’s next?
下一步是什么?

So, we now have a basic streaming ETL system running Alphakka Kafka on top of a cluster of Kafka brokers to form the reactive stream “backbone” for near real-time ETL between data stores. With Alpakka Slick and Alpakka Cassandra, a relational database like PostgreSQL and a Cassandra data warehouse can be made part of the system like composable stream components.
因此,我们现在有一个基本的流式ETL系统,在Kafka代理集群上运行Alphakka Kafka,以形成反应式流“骨干”,用于数据存储之间的近实时ETL。使用Alpakka Slick和Alpakka Cassandra,像PostgreSQL和Cassandra数据仓库这样的关系数据库可以像可组合的流组件一样成为系统的一部分。

As noted earlier, the existing Cassandra consumer does not guarantee at-least-once delivery, which is part of the requirement. In the next blog post, we’ll enhance the existing consumer to address the required delivery guarantee. We’ll also add a data processing pipeline to illustrate how to construct additional data pipelines as composable stream operators. All relevant source code along with some sample dataset will be published in a GitHub repo.
如前所述,现有的Cassandra消费者不保证至少一次交付,这是要求的一部分。在下一篇博文中,我们将增强现有消费者,以解决所需的交付保证问题。我们还将添加一个数据处理管道,以说明如何将其他数据管道构造为可组合的流运算符。所有相关源代码以及一些示例数据集都将发布在 GitHub 存储库中。

1 thought on “Streaming ETL With Alpakka Kafka
关于“使用 Alpakka Kafka 流式传输 ETL”的 1 条思考

  1. Pingback: ETL & Pipelining With Alpakka Kafka | Genuine Blog
    pingback: ETL & Pipelining with Alpakka Kafka |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

ETL & Pipelining With Alpakka Kafka

This extends the previous blog post about building an Akka Stream based real-time ETL system using Alpakka Kafka, although it could also be viewed as an independent post as I intend to include all the key elements of the system autonomously in this post.
这扩展了之前关于使用 Alpakka Kafka 构建基于 Akka 流的实时 ETL 系统的博客文章,尽管它也可以被视为一篇独立的文章,因为我打算在这篇文章中自主包含系统的所有关键元素。

Let’s first review the diagram shown in the previous post of what we’re aiming to build — a streaming ETL system empowered by reactive streams and Apache Kafka’s publish-subscribe machinery for durable stream data to be produced or consumed by various data processing/storage systems:
让我们首先回顾一下我们打算构建的上一篇文章中显示的图表 - 一个由反应式流和Apache Kafka的发布-订阅机制提供支持的流ETL系统,用于各种数据处理/存储系统生成或使用的持久流数据:

Alpakka Kafka - Streaming ETL

In this blog post, we’re going to:
在这篇博文中,我们将:

  1. enhance the data warehouse consumer to programmatically track the commit offset positions,
    增强数据仓库使用者以编程方式跟踪 commit 偏移位置,
  2. plug into an existing consumer a data processing pipeline as a stream processing operator, and,
    作为流处理运算符插入现有使用者数据处理管道,并且,
  3. add to the streaming ETL system a mix of heterogeneous producers and consumers
    将异构生产者和消费者的混合添加到流式 ETL 系统中

Action item #1 would address the requirement of at-least-once delivery in stream consumption. #2 illustrates how to add to the streaming ETL system a custom data pipeline as a composable stream flow, and #3 showcases how data in various storage systems can participate in the real-time stream to operate (serially or in parallel) as composable sources, flows and sinks. All relevant source code is available in this GitHub repo.
行动项 #1 将解决流消费中至少一次交付的要求。 #2 说明了如何将自定义数据管道作为可组合流添加到流式 ETL 系统, #3 展示了各种存储系统中的数据如何参与实时流以作为可组合源、流和接收器运行(串行或并行)。所有相关源代码都在此 GitHub 存储库中提供。

Real-time streaming ETL/pipelining of property listing data
房产列表数据的实时流式ETL/流水线

For usage demonstration, the application runs ETL/pipelining of data with a minified real estate property listing data model. It should be noted that expanding or even changing it altogether to a different data model should not affect how the core streaming ETL system operates.
为了进行使用演示,应用程序使用缩小的房地产列表数据模型运行数据的 ETL/流水线。应该注意的是,将其扩展甚至完全更改为不同的数据模型不应影响核心流ETL系统的运行方式。

Below are a couple of links related to library dependencies and configurations for the core application:
以下是与核心应用程序的库依赖项和配置相关的几个链接:

  • Library dependencies in build.sbt [separate tab]
    build.sbt [separate tab] 中的库依赖项
  • Configurations for Akka, Kafka, PostgreSQL & Cassandra in application.conf [separate tab]
    Configurations for Akka, Kafka, PostgreSQL & Cassandra in application.conf [separate tab]

It’s also worth noting that the application can be scaled up with just configurative changes. For example, if the Kafka brokers and Cassandra database span across multiple hosts, relevant configurations like Kafka’s bootstrap.servers could be "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092" and contact-points for Cassandra might look like ["10.2.0.1:9042","10.2.0.2:9042"].
还值得注意的是,只需配置更改即可扩展应用程序。例如,如果 Kafka 代理和 Cassandra 数据库跨多个主机,则 Kafka 的 bootstrap.servers 等相关配置可能是 "10.1.0.1:9092,10.1.0.2:9092,10.1.0.3:9092" ,而 Cassandra 的 contact-points 可能看起来像 ["10.2.0.1:9042","10.2.0.2:9042"]

Next, let’s get ourselves familiarized with the property listing data definitions in the PostgreSQL and Cassandra, as well as the property listing classes that model the schemas.
接下来,让我们熟悉 PostgreSQL 和 Cassandra 中的属性列表数据定义,以及为模式建模的属性列表类。

A Kafka producer using Alpakka Csv
使用Alpakka Csv的Kafka制作人

Alpakka comes with a simple API for CSV file parsing with method lineScanner() that takes parameters including the delimiter character and returns a Flow[ByteString, List[ByteString], NotUsed].
Alpakka附带了一个简单的API,用于使用方法 lineScanner() 进行CSV文件解析,该方法采用包括分隔符字符在内的参数并返回 Flow[ByteString, List[ByteString], NotUsed]

Below is the relevant code in CsvPlain.scala that highlights how the CSV file gets parsed and materialized into a stream of Map[String,String] via CsvParsing and CsvToMap, followed by transforming into a stream of PropertyListing objects.
下面是 CsvPlain.scala 中的相关代码,其中重点介绍了 CSV 文件如何通过 CsvParsingCsvToMap 解析并具体化为 Map[String,String] 流,然后转换为 PropertyListing 对象流。

Note that the drop(offset)/take(limit) code line, which can be useful for testing, is for taking a segmented range of the stream source and can be removed if preferred.
请注意, drop(offset)/take(limit) 代码行可用于测试,用于获取流源的分段范围,如果需要,可以将其删除。

A subsequent map wraps each of the PropertyListing objects in a ProducerRecord[K,V] with the associated topic and key/value of type String/JSON before being streamed into Kafka via Alpakka Kafka’s Producer.plainSink().
后续的 map 将每个 PropertyListing 对象包装在 ProducerRecord[K,V] 中,其中包含 String/JSON 类型的关联 topickey/value ,然后通过 Alpakka Kafka 的 Producer.plainSink() 流式传输到 Kafka 中。

A Kafka producer using Alpakka Slick
使用Alpakka Slick的卡夫卡制作人

The PostgresPlain producer, which is pretty much identical to the one described in the previous blog post, creates a Kafka producer using Alpakka Slick which allows SQL queries into a PostgreSQL database to be coded in Slick’s functional programming style.
PostgresPlain 生产者,与上一篇博客文章中描述的几乎相同,使用 Alpakka Slick 创建了一个 Kafka 生产者,它允许对 PostgreSQL 数据库的 SQL 查询以 Slick 的函数式编程风格进行编码。

The partial code below shows how method Slick.source() takes a streaming query and returns a stream source of PropertyListing objects.
下面的部分代码显示了方法 Slick.source() 如何接受流查询并返回 PropertyListing 对象的流源。

The high-level code logic in PostgresPlain is similar to that of the CsvPlain producer.
PostgresPlain 中的高级代码逻辑类似于 CsvPlain 生产者的逻辑。

A Kafka consumer using Alpakka Cassandra
使用Alpakka Cassandra的Kafka消费者

We created a Kafka consumer in the previous blog post using Alpakka Kafka’s Consumer.plainSource[K,V] for consuming data from a given Kafka topic into a Cassandra database.
我们在之前的博客文章中使用 Alpakka Kafka 的 Consumer.plainSource[K,V] 创建了一个 Kafka 消费者,用于将给定 Kafka topic 中的数据消费到 Cassandra 数据库中。

The following partial code from the slightly refactored version of the consumer, CassandraPlain shows how data associated with a given Kafka topic can be consumed via Alpakka Kafka’s Consumer.plainSource().
以下部分代码来自稍微重构的消费者版本,CassandraPlain 显示了如何通过 Alpakka Kafka 的 Consumer.plainSource() 使用与给定 Kafka topic 关联的数据。

Alpakka’s CassandraFlow.create() is the stream processing operator responsible for funneling data into the Cassandra database. Note that it takes a CQL PreparedStatement along with a “statement binder” that binds the incoming class variables to the corresponding Cassandra table columns before executing the CQL.
Alpakka的CassandraFlow.create()是负责将数据汇集到Cassandra数据库中的流处理运算符。请注意,它需要一个 CQL PreparedStatement 和一个“语句绑定器”,该绑定器在执行 CQL 之前将传入的类变量绑定到相应的 Cassandra 表列。

Enhancing the Kafka consumer for ‘at-least-once’ consumption
增强卡夫卡消费者的“至少一次”消费

To enable at-least-once consumption by Cassandra, instead of Consumer.plainSource[K,V], we construct the stream graph via Alpakka Kafka Consumer.committableSource[K,V] which offers programmatic tracking of the commit offset positions. By keeping the commit offsets as an integral part of the streaming data, failed streams could be re-run from the offset positions.
为了启用 Cassandra 的 at-least-once 消费,而不是 Consumer.plainSource[K,V] ,我们通过 Alpakka Kafka Consumer.committableSource[K,V] 构建流图,它提供了对提交偏移位置的编程跟踪。通过将提交偏移量保留为流数据的组成部分,可以从偏移位置重新运行失败的流。

The main stream composition code of the enhanced consumer, CassandraCommittable.scala, is shown below.
增强型消费者的主流组合代码CassandraCommittable.scala如下所示。

A couple of notes:
几点注意事项:

  1. In order to be able to programmatically keep track of the commit offset positions, each of the stream elements emitted from Consumer.committableSource[K,V] is wrapped in a CommittableMessage[K,V] object, consisting of the CommittableOffset value in addition to the Kafka ConsumerRecord[K,V].
    为了能够以编程方式跟踪提交偏移位置,从 Consumer.committableSource[K,V] 发出的每个流元素都包装在一个 CommittableMessage[K,V] 对象中,该对象除了 Kafka ConsumerRecord[K,V] 之外还由 CommittableOffset 值组成。
  2. Committing the offset should be done after the stream data is processed for at-least-once consumption, whereas committing prior to processing the stream data would only achieve at-most-once delivery.
    应在处理流数据以至少使用一次后提交偏移量,而在处理流数据之前提交只会实现最多一次交付。

Adding a property-rating pipeline to the Alpakka Kafka consumer
向Alpakka Kafka消费者添加属性评级管道

Next, we add a data processing pipeline to the consumer to perform a number of ratings of the individual property listings in the stream before delivering the rated property listing data to the Cassandra database, as illustrated in the following diagram.
接下来,我们向消费者添加一个数据处理管道,以便在将评级属性列表数据传递到 Cassandra 数据库之前,对流中的单个属性列表执行多个评级,如下图所示。

Alpakka Kafka - Streaming ETL w/ custom pipelines

Since the CassandraFlow.create() stream operator will be executed after the rating pipeline, the corresponding “statement binder” necessary for class-variable/table-column binding will now need to encapsulate also PropertyRating along with CommittableMessage[K,V], as shown in the partial code of CassandraCommittableWithRatings.scala below.
由于 CassandraFlow.create() 流运算符将在评级管道之后执行,因此类变量/表列绑定所需的相应“语句绑定器”现在也需要封装 PropertyRating 以及 CommittableMessage[K,V] ,如下面的 CassandraCommittableWithRatings.scala 的部分代码所示。

For demonstration purpose, we create a dummy pipeline for rating of individual real estate properties in areas such as affordability, neighborhood, each returning just a Future of random integers between 1 and 5 after a random time delay. The rating related fields along with the computation logic are wrapped in class PropertyRating as shown below.
出于演示目的,我们创建了一个虚拟管道,用于对负担能力、社区等领域的单个房地产属性进行评级,每个属性在随机时间延迟后仅返回 1 到 5 之间的 Future 个随机整数。与评级相关的字段以及计算逻辑包装在类 PropertyRating 中,如下所示。

A Kafka consumer with a custom flow & stream destination
具有自定义流和流目标的 Kafka 使用者

The application is also bundled with a consumer with the property rating pipeline followed by a custom flow to showcase how one can compose an arbitrary side-effecting operator with custom stream destination.
该应用程序还与使用者捆绑在一起,其中包含属性评级管道,后跟自定义流,以展示如何使用自定义流目标组合任意副作用运算符。

Note that mapAsync is used to allow the stream transformation by the custom business logic to be carried out asynchronously.
请注意,mapAsync 用于允许异步执行自定义业务逻辑的流转换。

Running the streaming ETL/pipelining system
运行流式 ETL/流水线系统

To run the application that comes with sample real estate property listing data on a computer, go to the GitHub repo and follow the README instructions to launch the producers and consumers on one or more command-line terminals.
若要在计算机上运行示例不动产列表数据附带的应用程序,请转到 GitHub 存储库并按照自述文件说明在一个或多个命令行终端上启动生产者和使用者。

Also included in the README are instructions about how to run a couple of set queries to verify data that get ETL-ed to the Cassandra tables via Alpakka Cassandra’s CassandraSource which takes a CQL query as its argument.
自述文件中还包括有关如何运行几个集合查询以验证通过 Alpakka Cassandra 的 CassandraSource 将 ETL 编辑到 Cassandra 表的数据的说明,该查询将 CQL 查询作为其参数。

Further enhancements
进一步的增强功能

Depending on specific business requirement, the streaming ETL system can be further enhanced in a number of areas.
根据特定的业务需求,流ETL系统可以在许多领域得到进一步增强。

  1. This streaming ETL system offers at-least-once delivery only in stream consumptions. If an end-to-end version is necessary, one could enhance the producers by using Producer.committabbleSink() or Producer.flexiFlow() instead of Producer.plainSink().
    此流式 ETL 系统仅在流消费中提供至少一次交付。如果需要端到端版本,可以使用 Producer.committabbleSink() 或 Producer.flexiFlow() 而不是 Producer.plainSink() 来增强生产者。
  2. For exactly-once delivery, which is a generally much more stringent requirement, one approach to achieve that would be to atomically persist the in-flight data with the corresponding commit offset positions using a reliable storage system.
    对于精确一次交付(通常要求要严格得多),实现此目的的一种方法是使用可靠的存储系统以原子方式持久保存具有相应提交偏移位置的动态数据。
  3. In case tracking of Kafka’s topic partition assignment is required, one can use Consumer.committablePartitionedSource[K,V] instead of Consumer.committableSource[K,V]. More details can be found in the tech doc.
    如果需要跟踪 Kafka 的主题分区分配,可以使用 Consumer.committablePartitionedSource[K,V] 而不是 Consumer.committableSource[K,V] 。更多详细信息可以在 技术文档 .
  4. To gracefully restart a stream on failure with a configurable backoff, Akka Stream provides method RestartSource.onFailuresWithBackoff for that as illustrated in an example in this tech doc.
    为了在失败时通过可配置的退避优雅地重新启动流,Akka Stream 为此提供了方法 RestartSource.onFailuresWithBackoff,如本技术文档中的示例所示。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

A Brief Overview Of Blockchains
区块链的简要概述

It’s early 2022, and blockchain has amassed more attention than ever. Initially emerged in the form of a cryptocurrency, followed by the rising of additional ones operating on open-source development platforms, and further catalyzed by a frenzy of NFT, the word “blockchain” has effectively evolved from a geek keyword into a household term.
现在是 2022 年初,区块链比以往任何时候都受到更多的关注。最初以加密货币的形式出现,随后在开源开发平台上运营的其他加密货币的兴起,并进一步受到 NFT 狂热的催化,“区块链”一词实际上已经从一个极客关键词演变成一个家喻户晓的术语。

While a majority of the public is still skeptical about the legitimacy of the blockchain phenomenon, apparently many bystanders are beginning to be bombarded by a mix of curiosity and a feeling of being left out, especially with the recent NFT mania.
虽然大多数公众仍然对区块链现象的合法性持怀疑态度,但显然许多旁观者开始受到好奇心和被冷落感的轰炸,尤其是在最近的 NFT 狂热中。

It should be noted that by “blockchains”, I’m generally referring to public permissionless blockchains for brevity.
应该注意的是,为了简洁起见,我所说的“区块链”通常是指公共 permissionless 区块链。

Permissionless vs permissioned
无需许可与许可

Most of the blockchains commonly heard of, such as Bitcoin, Ethereum, Cardano, are permissionless blockchains which anyone can anonymously participate. Many of these blockchains are open-source. On the other hand, permissioned blockchains from products like Hyperledger Fabric require permissions and proof of identity (e.g. KYC/AML) from their operators for participation.
大多数通常听说过的区块链,如比特币、以太坊、卡尔达诺,都是任何人都可以匿名参与的 permissionless 区块链。其中许多区块链都是开源的。另一方面,来自Hyperledger Fabric等产品的 permissioned 区块链需要权限和身份证明(例如 KYC/AML )的运营商参与。

From a programming enthusiast’s perspective, there is some interesting technological aspect in blockchain that warrants a deep dive. Rather than only authoring smart contracts at the user-application layer, exploring how a blockchain’s participating nodes competing to grow a decentralized immutable ledger via a consensus algorithm will give one a full picture of its technological merit. That’s one of the motivations for me to develop a proof-of-concept crypto-mining blockchain system back in 2020.
从编程爱好者的角度来看,区块链中有一些有趣的技术方面值得深入研究。不仅仅是在用户应用层编写智能合约,探索区块链的参与节点如何通过共识算法竞争发展去中心化的不可变账本,将使人们全面了解其技术价值。这是我在 2020 年开发概念验证加密挖掘区块链系统的动机之一。

Last I surveyed the blockchain landscape, the choice for a development platform was simple. That was about 3 years ago. Excluding permissioned-only blockchains, the only platform ready for “big time” development back then was Ethereum with scripting language Solidity and framework Truffle for smart contracts. By “big time”, I mean ready for the average programming enthusiasts to participate the ecosystem through available tech docs and self-learning. Much has evolved since.
最后我调查了区块链领域,开发平台的选择很简单。那是大约 3 年前的事了。除了仅限许可的区块链之外,当时唯一准备好进行“大时代”开发的平台是以太坊,其脚本语言Solidity和用于智能合约的框架Truffle。我所说的“大时代”是指准备好让普通编程爱好者通过可用的技术文档和自学来参与生态系统。从那以后,很多事情都发生了变化。

Trending blockchain development platforms
趋势区块链开发平台

There is now a big list of open-source blockchains besides Bitcoin and Ethereum — Solana, Cardano, Polkadot, Avalanche, Polygon, Algorand, Tezos, Flow, …, to name a few. From a technological perspective, what’s more significant is the abundance of dApp (decentralized application) development platforms provided by many of these blockchains. These dApp platforms built using contemporary programming languages like Go, Rust, Python, are competing among themselves, thus widening choices and expediting improvement that ultimately benefit the rapidly growing dApp development community.
除了比特币和以太坊之外,现在还有一大堆开源区块链——Solana、Cardano、Polkadot、Avalanche、Polygon、Algorand、Tezos、Flow......,仅举几例。从技术角度来看,更重要的是许多这些区块链提供的丰富的dApp(去中心化应用程序)开发平台。这些使用Go,Rust,Python等当代编程语言构建的dApp平台正在相互竞争,从而扩大了选择范围并加快了改进,最终使快速增长的dApp开发社区受益。

While there are relatively few business cases for building a custom blockchain system, a much greater demand has emerged over the past couple of years for developing smart contract applications particularly in industries such as supply chain, DeFi, collectibles, arts, gaming and metaverse. Many blockchain platforms provide programming SDKs (software development kits) for popular languages like JavaScript, Python, or DSLs (domain specific languages) such as Solidity that many Ethereum dApp developers are already familiar with.
虽然构建自定义区块链系统的商业案例相对较少,但在过去几年中,开发智能合约应用程序的需求要大得多,尤其是在供应链、DeFi、收藏品、艺术、游戏和元界等行业。许多区块链平台为JavaScript,Python或DSL(领域特定语言)等流行语言提供编程SDK(软件开发工具包),例如许多以太坊dApp开发人员已经熟悉的Solidity。

Meanwhile, frameworks for smart contract development that help boost developers’ productivity have also prospered. Within the Ethereum ecosystem, the once predominant Truffle is now competing head to head with other frameworks like Hardhat.
与此同时,有助于提高开发人员生产力的智能合约开发框架也蓬勃发展。在以太坊生态系统中,曾经占主导地位的Truffle现在正在与Hardhat等其他框架展开正面竞争。

Bitcoin versus Ethereum
比特币与以太坊

Even though Bitcoin remains the blockchain network with the largest market cap, it primarily serves a single purpose as a decentralized cryptocurrency. It wasn’t designed as a platform for development of business/financial applications like many other blockchain platforms were. Hence comparing Bitcoin with, say, Ethereum is like comparing apples to oranges. That said, Bitcoin’s price apparently is still driving the ups and downs of almost every average blockchain network out there.
尽管比特币仍然是市值最大的区块链网络,但它主要作为一种去中心化的加密货币服务于单一目的。它不像许多其他区块链平台那样被设计为开发业务/金融应用程序的平台。因此,将比特币与以太坊进行比较就像将苹果与橙子进行比较。也就是说,比特币的价格显然仍在推动几乎所有普通区块链网络的起伏。

Despite all the relatively new blockchain development platforms emerged over the past few years, Ethereum remains the distinguished leader with its TVL (total value locked) taking up well over half of the pie comprising all chains. To overcome the known issues of its underlying PoW (Proof of Work) consensus in scalability (and eco-friendliness), resulting in slow transactions and high gas fees, Ethereum is undergoing a major transition to Ethereum 2.0 with a PoS (Proof of Stake) consensus that will supposedly allow it to cope with future growth.
尽管在过去几年中出现了所有相对较新的区块链开发平台,但以太坊仍然是杰出的领导者,其TVL(总价值锁定)占据了所有链中一半以上的份额。为了克服其底层PoW(工作量证明)共识在可扩展性(和生态友好性)方面的已知问题,导致交易缓慢和高额的gas费用,以太坊正在经历向以太坊2.0的重大过渡,具有PoS(权益证明)共识,据说这将使它能够应对未来的增长。

Layer-2 blockchains on top of Ethereum
以太坊之上的第 2 层区块链

To circumvent Ethereum’s existing scalability problem, various kinds of off-chain solutions have been put in place. They are oftentimes broadly referred to as layer-2 blockchain solutions, supplementing the underlying layer-1 blockchain (in this case Ethereum). The common goal of the layer-2 solutions is to alleviate loads from the main chain (i.e. Ethereum) by delegating the bulk of compute-intensive tasks to auxiliary chains with more efficient way of handling those tasks.
为了规避以太坊现有的可扩展性问题,已经实施了各种链下解决方案。它们通常被广泛称为第 2 层区块链解决方案,补充底层第 1 层区块链(在本例中为以太坊)。第 2 层解决方案的共同目标是通过将大量计算密集型任务委托给具有更有效处理这些任务的辅助链来减轻主链(即以太坊)的负载。

Here’s a quick rundown of the most common types of layer-2 blockchain solutions:
以下是最常见的第 2 层区块链解决方案类型的简要介绍:

Rollup – a kind of the off-chain solutions that executes transactions in rolled-up batches outside of the main chain while keeping transaction data and proof mechanism on the chain. A zero-knowledge (ZK) rollup performs validity proof whereas an optimistic rollup (e.g. Optimism) assumes transactions are valid and runs fraud proof upon challenge calls. e.g. Loopring is a ZK rollup and Optimism is an optimistic rollup.
Rollup – 一种链下解决方案,在主链外以汇总的批次执行交易,同时将交易数据和证明机制保留在链上。零知识 (ZK) 汇总执行有效性证明,而乐观汇总(例如乐观)假设交易有效,并在质询调用时运行欺诈证明。例如 路印是ZK汇总,乐观是乐观汇总。

State Channel – another kind of solutions that allows its participants to bypass node validation process on the main chain by locking certain portion of the state via a multi-signature smart contract, perform transactions off-chain and unlock the state with appended state changes back to the main chain. Examples of state channel operators are Raiden, Connext.
状态通道 – 另一种解决方案,允许其参与者通过多重签名智能合约锁定状态的某些部分来绕过主链上的节点验证过程,在链下执行交易并通过附加状态更改解锁状态回到主链。状态信道运营商的例子有雷电、康纳。

Plasma – a “nested” blockchain with its own separate security from the main chain that performs basic transactions and relies on fraud proof upon validity challenge. e.g. OMG Plasma Project.
Plasma – 一个“嵌套”区块链,与执行基本交易的主链具有独立的安全性,并依赖于有效性挑战的欺诈证明。例如 OMG等离子体项目 .

Sidechain – a relatively more independent off-chain setup with its own security, consensus mechanism and block structure. e.g. Polygon is a popular layer-2 blockchain of this category.
侧链 – 一个相对更独立的链下设置,具有自己的安全性、共识机制和区块结构。例如 Polygon是此类流行的第2层区块链。

For more details, visit Ethereum’s developer site re: off-chain scaling.
有关更多详细信息,请访问以太坊的开发人员网站 re:链下扩展 。

Other layer-1 blockchains
其他第 1 层区块链

Meanwhile, a number of layer-1 blockchains such as Avalanche, Solana, Algorand, have been steadily growing their market share (in terms of TVL). Many of these blockchains were built using leading-edge tech stacks like Rust, Go, Haskell, with improved architectures and more efficient, scalable eco-friendly consensus mechanisms. By offering the dApp development community cutting-edge technology and versatile tools/SDKs, these blockchain operators strategically grow their market share in the highly competitive space.
与此同时,一些第一层区块链,如Avalanche、Solana、Algorand,一直在稳步增长其市场份额(以 TVL 计算)。其中许多区块链都是使用Rust,Go,Haskell等领先的技术堆栈构建的,具有改进的架构和更高效,可扩展的环保共识机制。通过为dApp开发社区提供尖端技术和多功能工具/ SDK,这些区块链运营商在竞争激烈的领域战略性地扩大了市场份额。

While “Ethereum killers” sounds like a baiting term, it’s indeed possible for one or more of these blockchains to dethrone Ethereum before its v2.0 has a chance to succeed the underperforming v1.0 in mid/late 2022. With things evolving at a cut-throat pace in the blockchain world and Ethereum’s upgrade taking considerable time, wouldn’t it be logical to expect that one of the newer blockchains with improved design would swiftly take over as the leader? Evidently, Ethereum’s huge lead in market share (again, in terms of TVL) and early adopters buy-in has helped secure its leader position. Perhaps more importantly is the inherent limitations imposed on any given blockchain that improvement can’t be made simultaneously in all aspects.
虽然“以太坊杀手”听起来像是一个诱饵术语,但这些区块链中的一个或多个确实有可能在其 v2.0 有机会在 2022 年中下旬接替表现不佳的 v1.0 之前推翻以太坊。随着区块链世界以残酷的速度发展,以太坊的升级需要相当长的时间,期望一个具有改进设计的较新的区块链将迅速成为领导者,这不是合乎逻辑的吗?显然,以太坊在市场份额(再次,就 TVL 而言)和早期采用者购买方面的巨大领先优势有助于确保其领导者地位。也许更重要的是,任何给定的区块链都存在固有的限制,无法在所有方面同时进行改进。

Trilemma vs CAP theorem
三难与CAP定理

Somewhat analogous to the CAP theorem (consistency/availability/partition tolerance) for distributed systems, the blockchain trilemma claims that security, scalability and decentralization cannot be simultaneously maximized without certain trade-off among them. For distributed systems, a decent degree of partition tolerance is a must. Thus they generally trade consistency for availability (e.g. Cassandra) or vice versa (e.g. HBase).
有点类似于分布式系统的CAP定理(一致性/可用性/分区容错),区块链三难困境声称,如果没有某些权衡,安全性,可扩展性和去中心化就无法同时最大化。对于分布式系统,一定程度的分区容错是必须的。因此,他们通常以一致性换取可用性(例如 卡桑德拉)反之亦然(例如 HBase )。

On the other hand, a blockchain by design should be highly decentralized. But it also must be highly secured, or else immutability of the stored transactions can’t be guaranteed. For a given blockchain, there is much less room for trade-off given that security and decentralization are integrally critical, in a way leaving scalability the de facto sacrificial lamb.
另一方面,区块链的设计应该是高度去中心化的。但它也必须高度安全,否则无法保证存储事务的不变性。对于给定的区块链,鉴于安全性和去中心化是不可或缺的,因此权衡的余地要小得多,在某种程度上,可扩展性实际上是牺牲的羔羊。

This is where the trilemma differs from CAP. Under the well formulated CAP theorem, suitable trade-off among the key requirements can result in a practical distributed system. For instance, Cassandra defers consistency for better availability and remains a widely adopted distributed storage system.
这就是三难困境与CAP的不同之处。在完善的CAP定理下,在关键需求之间进行适当的权衡可以产生一个实用的分布式系统。例如,Cassandra推迟一致性以获得更好的可用性,并且仍然是一个广泛采用的分布式存储系统。

Given the axiomatic importance of security and decentralization in a blockchain, developers essentially have to forgo the trade-off approach and think outside the box on boosting scalability. Sharding helps address the issue to some extent. Some contemporary blockchains (e.g. Avalanche) boost up scalability and operational efficiency by splitting governance, exchange and smart contract development into separate inter-related chains each with optimal consensus algorithm. Then there are always the various intermediary off-chain (layer-2) solutions as described earlier.
鉴于区块链中安全性和去中心化的公理重要性,开发人员基本上必须放弃权衡方法,并跳出框框思考提高可扩展性。 分片在一定程度上有助于解决这个问题。一些当代区块链(例如Avalanche)通过将治理,交换和智能合约开发拆分为单独的相互关联的链来提高可扩展性和运营效率,每个链都具有最佳的共识算法。然后总是有前面描述的各种中间链下(第 2 层)解决方案。

What about layer-0?
第 0 层呢?

While a layer-1 blockchain allows developers to create and run platform-specific dApps/smart contracts, a layer-0 blockchain network is one on which blockchain operators build independent blockchains with full sovereignty. To build a custom blockchain, developers may elect to repurpose from an existing open-source layer-1 blockchain which consists of specific features of interest. When high autonomy and custom features (e.g. custom security mechanism) are required, it might make more sense to build it off a layer-0 network which provides an inter-network “backbone” with an underlying consensus algorithm, standard communication protocols among blockchains and SDKs with comprehensive features.
虽然第 1 层区块链允许开发人员创建和运行特定于平台的 dApp/智能合约,但第 0 层区块链网络是区块链运营商构建具有完全主权的独立区块链的网络。要构建自定义区块链,开发人员可以选择从现有的开源第 1 层区块链中重新利用,该区块链由感兴趣的特定功能组成。当需要高度自治和自定义功能(例如自定义安全机制)时,将其构建在第 0 层网络上可能更有意义,该网络提供具有底层共识算法、区块链之间的标准通信协议和具有全面功能的 SDK 的网络间“骨干”。

One of the popular layer-0 networks is Cosmos which some high-profile blockchains like Binance Chain were built on top of. Another layer-0 network is Polkadot that offers a “centralized” security model, contrary to Cosmos’ leaving the responsibility of security setup to individual blockchain operators. For core feature comparisons between the two networks, here’s a nice blog post.
流行的第 0 层网络之一是 Cosmos,一些备受瞩目的区块链(如币安链)就是建立在它之上的。另一个0层网络是Polkadot,它提供了一个“集中式”的安全模型,这与Cosmos将安全设置的责任留给各个区块链运营商相反。对于两个网络之间的核心功能比较,这里有一篇不错的博客文章。

Blockchain oracles
区块链预言机

Then there are blockchain projects whose main role is to connect a blockchain with data sources from the outside world. A blockchain can be viewed as an autonomous ecosystem operating in an isolated environment with its own governance model. Such isolation is by-design, disallowing on-chain smart contracts to arbitrarily reference external data like currency rates, IoT sensor data, that can be prone to fraud.
然后是区块链项目,其主要作用是将区块链与来自外部世界的数据源连接起来。区块链可以被视为一个在具有自己的治理模型的隔离环境中运行的自治生态系统。这种隔离是设计使然,不允许链上智能合约任意引用容易发生欺诈的外部数据,如货币汇率、物联网传感器数据。

An oracle provide a “gateway” for the smart contracts on a blockchain to connect to off-chain data sources in the real world. To ensure trustlessness, decentralized oracles emerge to follow the very principle embraced by general-purpose blockchains. Examples of such oracles are Chainlink and Band Protocol.
预言机为区块链上的智能合约提供了一个“网关”,以连接到现实世界中的链下数据源。为了确保无信任性,去中心化预言机的出现遵循了通用区块链所接受的原则。这种预言机的例子是 链链 和 带协议 .

NFT – provenance of authenticity
NFT – 真实性的来源

NFT, short for non-fungible token, has taken the world by storm. It’s riding on the trend of blockchain in its early stage when a majority of the general public are still wondering what to make of the phenomenon. At present, the most popular use case of NFT is probably in provenance of authenticity. By programmatically binding a given asset (e.g. a painting) to a unique digital token consisting of pointers to unchangeable associated transactions (e.g. transfers of ownership) on a blockchain, the NFT essentially represents the digital receipt of the asset.
NFT 是不可替代代币的缩写,风靡全球。它在早期阶段就顺应了区块链的趋势,当时大多数公众仍然想知道如何看待这种现象。目前,NFT最流行的用例可能是真实性的出处。通过以编程方式将给定资产(例如绘画)绑定到由指向区块链上不可更改的相关交易(例如所有权转让)的指针组成的唯一数字代币,NFT 本质上表示资产的数字收据。

A majority of the NFTs seen today are associated with digital assets such as digital arts. A similar provenance mechanism can also be extended to cover physical assets like collectible arts by having the corresponding NFTs referencing immutable “digital specs” of the physical items. The “digital specs” of a given physical item could be a combination of the unique ID of the engraved tag, detailed imagery, etc.
今天看到的大多数 NFT 都与数字艺术等数字资产有关。类似的来源机制也可以扩展到涵盖收藏品等实物资产,方法是让相应的 NFT 引用实物物品的不可变“数字规格”。给定物理物品的“数字规格”可以是雕刻标签的唯一 ID、详细图像等的组合。

Given the prominence of Ethereum, most NFTs follow its standards such as ERC-721 standard which requires a “smart contract” that programmatically describes the key attributes of the token and implements how the token can be transferred among accounts. Once a smart contract is deployed on a blockchain, it cannot be changed. Nor can any results from executing the contract that are stored in validated blocks.
鉴于以太坊的突出地位,大多数 NFT 都遵循其标准,例如 ERC-721 标准,该标准需要一个“智能合约”,以编程方式描述代币的关键属性并实现代币如何在账户之间转移。一旦智能合约部署在区块链上,就无法更改。执行存储在验证块中的合约的任何结果也不会。

Outside of the Ethereum arena, some blockchain projects come up with their own native NFT standards. For example, Algorand’s NFT is treated as one of it’s own built-in asset types and can be created by parametrically specifying certain attributes in the metadata of the digital asset without the need of an explicit smart contract. There are also blockchains like Avalanche providing both their native NFTs as well as Ethereum-compatible ones.
在以太坊领域之外,一些区块链项目提出了自己的原生 NFT 标准。例如,Algorand 的 NFT 被视为其自己的内置资产类型之一,可以通过在数字资产元数据中参数化指定某些属性来创建,而无需明确的智能合约。还有像 Avalanche 这样的区块链既提供原生 NFT 也提供与以太坊兼容的 NFT。

Final Thoughts

Over the years, the various emerged blockchains have interwoven into an inter-network that is increasingly more structured and functionally rich. The blockchain-powered inter-network is being touted by many early players as the next generation of the internet and is sometimes referred to as Web3. Layer-0 blockchain projects aim exactly to be an inter-network of independent blockchains. Among other blockchain projects, some aim to create a blockchain-powered internet extension (e.g. Internet Computer).
多年来,各种新兴的区块链已经交织成一个越来越结构化和功能丰富的互联网络。区块链驱动的互联网络被许多早期参与者吹捧为下一代互联网,有时被称为 Web3 。Layer-0区块链项目的目标正是成为独立区块链的互联网络。在其他区块链项目中,一些旨在创建一个区块链驱动的互联网扩展(例如 互联网计算机)。

From a technological point of view, there are definitely merits of using blockchain’s underlying technology. In particular, the trustless decentralized ledger on a blockchain that keeps an immutable log of transactions can be used in many business use cases in various industry sectors like asset management. Some real-world applications using permissioned blockchains have demonstrated success in, for instance, improving traceability and transparency in the supply chain industry. The permission and proof of identity requirement does make fraud attempts significantly harder. Nonetheless, that inevitably breaks trustlessness, although such requirement is generally acceptable in operating a private alliance network.
从技术角度来看,使用区块链的底层技术肯定有优点。特别是,区块链上保存不可变交易日志的去信任去中心化分类账可用于资产管理等各个行业的许多业务用例。一些使用许可区块链的实际应用程序已经证明在提高供应链行业的可追溯性和透明度方面取得了成功。许可和身份证明要求确实使欺诈企图变得更加困难。尽管如此,这不可避免地打破了不信任,尽管这种要求在运营私人联盟网络时通常是可以接受的。

Viewing from a different angle, cryptocurrencies except for stablecoins like Tether are volatile and will likely be so in the foreseeable future. Given that most prominent public blockchains are operated in conjunction with a corresponding cryptocurrency, when evaluating for a blockchain platform to land on, it warrants a good look at its cryptocurrency counterpart. Even though volatility may not be as critical for the purpose of development as for investment, it does make assessment of a given blockchain’s long-term prospect non-trivial.
从另一个角度来看,除了像 Tether 这样的稳定币之外,加密货币都是不稳定的,并且在可预见的未来可能会如此。鉴于大多数著名的公共区块链都是与相应的加密货币一起运行的,在评估区块链平台登陆时,它值得仔细研究其加密货币对应物。尽管波动性对于发展目的可能不像投资那样重要,但它确实使对给定区块链长期前景的评估变得不平凡。

Another obstacle to mainstream adoption of public blockchains is that many blockchain projects have emerged with superficial or even downright fraudulent plans for opportunistic reward in the prospering space. And the NFT frenzy permeated with scams also hasn’t helped gaining public trust either. In addition, people are overwhelmed by the numerous blockchain projects out there. As of this post, there are thousands of public blockchain projects, over 80 of which each with its corresponding cryptocurrency market cap above $1 billion. It’s likely only a small percentage of them will prevail when the dust settles.
主流采用公共区块链的另一个障碍是,许多区块链项目已经出现了肤浅甚至彻头彻尾的欺诈计划,以便在繁荣的领域获得机会主义奖励。充斥着骗局的 NFT 狂潮也无助于获得公众的信任。此外,人们对众多的区块链项目感到不知所措。截至本文发布时,有数千个公共区块链项目,其中80多个项目对应的加密货币市值超过10亿美元。当尘埃落定时,可能只有一小部分会占上风。

All those issues are in some way hindering blockchain technology from becoming a lasting mainstream computing class. On the other hand, for those who are adventurous and determined to make the most out of the yet-to-be-mature but improving computing technology, it may be the right time now to dive in.
所有这些问题在某种程度上阻碍了区块链技术成为持久的主流计算课程。另一方面,对于那些喜欢冒险并决心充分利用尚未成熟但不断改进的计算技术的人来说,现在可能是潜入的合适时机。

1 thought on “A Brief Overview Of Blockchains
关于“区块链简要概述”的 1 条思考

  1. Pingback: Ethereum-compatible NFT on Avalanche | Genuine Blog
    pingback:雪崩上的以太坊兼容 NFT |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Ethereum-compatible NFT On Avalanche
雪崩上的以太坊兼容 NFT

While blockchain has been steadily gaining increasing attention from the general public over the past couple of years, it’s NFT, short for non-fungible token, that has recently taken the center stage. In particular, NFT shines in the area of provenance of authenticity. By programmatically binding a given asset to a unique digital token referencing immutable associated transactions on a blockchain, the NFT essentially serves as the “digital receipt” of the asset.
虽然区块链在过去几年中越来越受到公众的关注,但它是 NFT(不可替代代币的缩写)最近占据了中心舞台。特别是,NFT 在真实性来源领域大放异彩。通过以编程方式将给定资产绑定到引用区块链上不可变关联交易的唯一数字代币,NFT 本质上充当资产的“数字收据”。

Currently Ethereum is undergoing a major upgrade to cope with future growth of the blockchain platform which has been suffering from low transaction rate and high gas fee due to the existing unscalable Proof of Work consensus algorithm. As described in a previous blockchain overview blog post, off-chain solutions including bridging the Ethereum main chain with layer-2 subchains such as Polygon help circumvent the performance issue.
目前,以太坊正在进行重大升级,以应对区块链平台的未来增长,由于现有的不可扩展工作量证明共识算法,该平台一直遭受低交易率和高gas费用的困扰。如之前的区块链概述博客文章所述,链下解决方案包括将以太坊主链与Polygon等第2层子链连接起来,有助于规避性能问题。

Avalanche

Some layer-1 blockchains support Ethereum’s NFT standards (e.g. ERC-721, ERC-1155) in addition to providing their own native NFT specs. Among them is Avalanche which has been steadily growing its market share (in terms of TVL), trailing behind only a couple of prominent layer-1 blockchains such as Solana and Cardano.
一些第 1 层区块链支持以太坊的 NFT 标准(例如 ERC-721 、ERC-1155 )除了提供自己的原生 NFT 规格。其中包括Avalanche,它一直在稳步增长其市场份额(就TVL而言),仅落后于几个著名的第1层区块链,如Solana和Cardano。

With separation of concerns (SoC) being one of the underlying design principles, Avalanche uses a subnet model in which validators on the subnet only operate on the specific blockchains of their interest Also in line with the SoC design principle, Avalanche comes with 3 built-in blockchains each of which serves specific purposes with its own set of API:
由于关注点分离(SoC)是基础设计原则之一,Avalanche使用子网模型,其中子网上的验证者仅在他们感兴趣的特定区块链上运行。 同样根据SoC设计原则,Avalanche带有3个内置区块链,每个区块链都有自己的一组API服务于特定目的:

  • Exchange Chain (X-Chain) – for creation & exchange of digital smart assets (including its native token AVAX) which are bound to programmatic governance rules
    交易所链(X-Chain) – 用于创建和交换受程序化治理规则约束的数字智能资产(包括其原生代币AVAX)
  • Platform Chain (P-Chain) – for creating & tracking subnets, each comprising a dynamic group of stake holders responsible for consensually validating blockchains of interest
    平台链(P-Chain) – 用于创建和跟踪 subnets ,每个由一组动态的利益相关者组成,负责自愿验证感兴趣的区块链
  • Contract Chain (C-Chain) – for developing smart contract applications
    合约链(C链) – 用于开发智能合约应用程序

NFT on Avalanche
雪崩上的 NFT

Avalanche allows creation of native NFTs as a kind of its smart digital assets. Its website provides tutorials for creating such NFTs using its Go-based AvalancheGo API. But perhaps its support of the Ethereum-compatible NFT standards with much higher transaction rate and lower cost than the existing Ethereum mainnet is what helps popularize the platform.
Avalanche 允许创建原生 NFT 作为其智能数字资产的一种。其网站提供了使用其基于 Go 的 AvalancheGo API 创建此类 NFT 的教程。但也许它支持与以太坊兼容的 NFT 标准,比现有的以太坊主网具有更高的交易率和更低的成本,这有助于普及该平台。

In this blog post, we’re going to create on the Avalanche platform ERC-721 compliant NFTs which require programmatic implementation of their sale/transfer terms in smart contracts. C-Chain is therefore the targeted blockchain. And rather than deploying our NFTs on the Avalanche mainnet, we’ll use the Avalanche Fuji Testnet which allows developers to pay for transactions in test-only AVAX tokens freely available from some designated crypto faucet.
在这篇博文中,我们将在 Avalanche 平台上创建符合 ERC-721 的 NFT,这些 NFT 需要在智能合约中以编程方式实施其销售/转让条款。 因此,C-Chain是目标区块链。我们不会在雪崩主网上部署我们的 NFT,而是使用 Avalanche 富士测试网,它允许开发人员使用一些指定的加密水龙头免费提供的仅限测试的 AVAX 代币支付交易费用。

Scaffold-ETH: an Ethereum development stack
Scaffold-ETH:以太坊开发堆栈

A code repository of comprehensive Ethereum-based blockchain computing functions, Scaffold-ETH, offers a suite of tech stacks best for fast prototyping development along with sample code for various use cases of decentralized applications. The stacks include Solidity, Hardhat, Ether.js and ReactJS.
Scaffold-ETH 是一个全面的基于以太坊的区块链计算功能的代码存储库,提供了一套最适合快速原型开发的技术堆栈,以及用于去中心化应用程序各种用例的示例代码。堆栈包括Solidity,Hardhat,Ether.js和ReactJS。

The following softwares are required for installing Scaffold-ETH, building and deploying NFT smart contracts:
安装脚手架-ETH、构建和部署 NFT 智能合约需要以下软件:

Launching NFTs on Avalanche using a customized Scaffold-ETH
使用定制的脚手架-ETH 在雪崩上启动 NFT

For the impatient, the revised code repo is at this GitHub link. Key changes made to the original branch in Scaffold-ETH will be highlighted at the bottom of this post.
对于不耐烦的人,修改后的代码存储库位于 这个 GitHub 链接 .对 Scaffold-ETH 中的原始分支所做的关键更改将在本文底部突出显示。

To get a copy of Scaffold-ETH repurposed for NFTs on Avalanche, first git-clone the repo:
要获得在雪崩上重新用于 NFT 的脚手架-ETH 的副本,请首先对存储库进行 git 克隆:

Next, open up a couple of shell command terminals and navigate to the project-root (e.g. avax-scaffold-eth-nft).
接下来,打开几个 shell 命令终端并导航到项目根目录(例如 avax-scaffold-eth-nft)。

Step 1: From the 1st shell terminal, install the necessary dependent modules.
第 1 步:从第一个外壳终端安装必要的依赖模块。

Step 2: From the 2nd terminal, specify an account as the deployer.
第 2 步:在第二个终端中,指定一个帐户作为部署程序。

Choose an account that owns some AVAX tokens (otherwise, get free tokens from an AVAX faucet) on the Avalanche Fuji testnet and create file packages/hardhat/mnemonic.txt with the account’s 12-word mnemonic in it.
在Avalanche Fuji测试网上选择一个拥有一些AVAX令牌的帐户(否则,从AVAX水龙头获取免费令牌),并创建包含该帐户的12个单词助记符的文件 packages/hardhat/mnemonic.txt

For future references, the “deployed at” smart contract address should be saved. Transactions oriented around the smart contract can be reviewed at snowtrace.io.
为了将来参考,应保存“部署于”智能合约地址。围绕智能合约进行的交易可以在 snowtrace.io 进行审查。

Step 3: Back to the 1st terminal, start the Node.js server at port# 3000.
步骤3:回到第一个终端,在端口#3000启动Node.js服务器。

This will spawn a web page on the default browser (which should have been installed with the MetaMask extension).
这将在默认浏览器上生成一个网页(应该已经安装了小狐狸钱包扩展)。

Step 4: From the web browser, connect to the MetaMask account which will receive the NFTs
第 4 步:从网络浏览器连接到将接收 NFT 的小狐狸钱包账户

Step 5: Back to the 2nd terminal, mint the NFTs.
第 5 步:回到第二个终端,铸造 NFT。

The address of the NFT recipient account connected to the browser app will be prompted. Upon successful minting, images of the NFTs should be automatically displayed on the web page.
将提示连接到浏览器应用程序的 NFT 接收者帐户的地址。成功铸造后,NFT 的图像应自动显示在网页上。

To transfer any of the NFTs to another account, enter the address of the account to be transferred to and click “transfer”. Note that the account connected to the browser app would need to own some AVAX tokens (again if not, get free tokens from an AVAX faucet).
要将任何 NFT 转移到另一个账户,请输入要转移到的账户地址,然后单击“转移”。请注意,连接到浏览器应用程序的帐户需要拥有一些 AVAX 令牌(如果没有,请从 AVAX 水龙头获取免费令牌)。

The web page upon successful minting should look like below:
成功铸造后的网页应如下所示:

Avalanche NFTs using Scaffold-ETH (MetaMask connected)

Key changes made to the original Scaffold-ETH branch
对原始脚手架-ETH 分支所做的关键更改

It should be noted that Scaffold-ETH is a popular code repo under active development. The branch I had experimented with a few months ago is already markedly different from the same branch I git-cloned for custom modification. That prompted me to clone a separate repo to serve as a “snapshot” of the branch, rather than just showing my modifications to an evolving code base.
应该注意的是,Scaffold-ETH是一个正在积极开发的流行代码存储库。我几个月前尝试过的分支已经与我克隆用于自定义修改的同一分支明显不同。这促使我克隆一个单独的存储库作为分支的“快照”,而不仅仅是显示我对不断发展的代码库的修改。

Below are the main changes made to the Scaffold-ETH Simple NFT Example branch git-cloned on March 30:
以下是对 3 月 30 日克隆的脚手架-ETH 简单 NFT 示例分支 git 所做的主要更改:

Hardhat configuration script: packages/hardhat/hardhat.config.js
Hardhat 配置脚本:packages/hardhat/hardhat.config.js

The defaultNetwork value in the original Hardhat configuration script is “localhost” by default, assuming a local instance of a selected blockchain is in place. The following change sets the default network to the Fuji testnet, whose network configuration parameters need to be added as shown below.
默认情况下,原始 Hardhat 配置脚本中的 defaultNetwork 值为“localhost”,假设所选区块链的本地实例已到位。以下更改将默认网络设置为富士测试网,其网络配置参数需要添加,如下所示。

Note that with the explicit defaultNetwork value set to “fujiAvalanche”, one could skip the --network fujiAvalanche command line option in the smart contract deploy and mint commands.
请注意,将显式 defaultNetwork 值设置为 “fujiAvalanche” 时,可以跳过智能合约 deploymint 命令中的 --network fujiAvalanche 命令行选项。

ReactJS main app: packages/react-app/src/App.jsx
ReactJS main app: packages/react-app/src/App.jsx

To avoid compilation error, the following imports need to be moved up above the variable declaration section in main Node.js app.
为避免编译错误,需要将以下导入移到主 Node.js 应用中的变量声明部分上方。

Minting script: packages/hardhat/scripts/mint.js
铸币脚本:包装/安全帽/脚本/薄荷.js

A few notes:
一些注意事项:

  • The square-shaped animal icon images for the NFTs used in the minting script are from public domain sources. Here’s the link to the author’s website.
    铸币脚本中使用的 NFT 的方形动物图标图像来自公共领域来源。这是作者网站的链接。
  • Node module prompt-sync is being used (thus is also added to the main package.json dependency list). It’s to avoid having to hardcode the NFT recipient address in the minting script.
    正在使用节点模块 prompt-sync (因此也被添加到主 package.json 依赖项列表中)。这是为了避免在铸造脚本中对 NFT 收件人地址进行硬编码。
  • The code below makes variable toAddress a dynamic input value and replaces the original NFT images with the square-styling images along with a modularized mintItem function.
    下面的代码使变量 toAddress 成为动态输入值,并将原始 NFT 图像替换为方形样式图像以及模块化 mintItem 函数。

1 thought on “Ethereum-compatible NFT On Avalanche
关于“雪崩上的以太坊兼容 NFT”的 1 条思考

  1. Pingback: Algorand NFTs With IPFS Assets | Genuine Blog
    pingback:拥有 IPFS 资产的 Algorand NFT |正版博客

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Algorand NFTs With IPFS Assets
拥有IPFS资产的AlgorandNFT

As of this writing, the overall cryptocurrency sector is undergoing a major downturn in which about two-third of the average blockchain’s market cap has evaporated since early April. This might not be the best time to try make anyone excited about launching NFTs on any blockchain. Nevertheless, volatility of cryptocurrencies has always been a known phenomenon. From a technological perspective, it’s much less of a concern than how well-designed the underlying blockchain is in terms of security, decentralization and scalability.
在撰写本文时,整个加密货币行业正在经历一次重大低迷,自4月初以来,区块链平均市值的三分之二已经蒸发。这可能不是尝试让任何人对在任何区块链上启动 NFT 感到兴奋的最佳时机。尽管如此,加密货币的波动性一直是一个众所周知的现象。从技术角度来看,这比底层区块链在安全性、去中心化和可扩展性方面的设计程度要小得多。

Many popular blockchain platforms out there are offering their own open-standard crypto tokens, but so far the most prominent crypto tokens, fungible or non-fungible, are Ethereum-based. For instance, the NFTs we minted on Avalanche‘s Fuji testnet in a previous blog post are Ethereum-based and ERC-721 compliant.
许多流行的区块链平台都提供自己的开放标准加密代币,但到目前为止,最著名的加密代币,无论是可替代的还是不可替代的,都是基于以太坊的。例如,我们在之前的博客文章中在 Avalanche 的富士测试网上铸造的 NFT 是基于以太坊且符合 ERC-721 标准的。

Given Ethereum’s large market share, dApp developers generally prefer having their NFTs transacting on Ethereum main chain or Ethereum-compatible chains like Polygon and Avalanche’s C-Chain, yet many others opt to pick incompatible chains as their NFT development/trading platforms.
鉴于以太坊的巨大市场份额,dApp 开发人员通常更喜欢让他们的 NFT 在以太坊主链或以太坊兼容链(如 Polygon 和 Avalanche 的 C 链)上进行交易,但许多其他人选择选择不兼容的链作为他们的 NFT 开发/交易平台。

Choosing a blockchain for NFT development
选择区块链进行 NFT 开发

Use cases of NFT are mostly about provenance of authenticity. To ensure that the NFT related transactions to be verifiable at any point of time, the perpetual availability of the blockchain in which the transaction history reside is critical. Though not often, hard forks do happen, consequently rendering historical transactions forking off the main chain. That’s highly undesirable. Some blockchains such as Algorand and Polkadot tout that their chains are by-design “forkless”, which does provide an edge over competitors.
NFT 的用例主要是关于真实性的来源。为了确保 NFT 相关交易在任何时间点都是可验证的,交易历史所在的区块链的永久可用性至关重要。虽然不经常发生,但硬分叉确实会发生,因此使历史交易从主链上分叉。这是非常不可取的。Algorand和Polkadot等一些区块链吹捧他们的链是设计“无分叉”的,这确实提供了优于竞争对手的优势。

Another factor critical for transacting NFTs on a blockchain is low latency in consensually finalizing transactions. That latency is generally referred to as finality. Given that auctions are commonly time-sensitive events for trading of NFTs, a long delay is obviously ill-suited. Chains like Avalanche and Algorand are able to keep the finality under 5 seconds which is much shorter compared to 10+ minutes required on other blockchains like Cardano or Ethereum.
在区块链上交易 NFT 的另一个关键因素是双方同意完成交易的低延迟。这种延迟通常称为 最终性 .鉴于拍卖通常是 NFT 交易的时间敏感事件,长时间延迟显然不适合。像Avalanche和Algorand这样的链能够将最终性保持在5秒以内,这比卡尔达诺或以太坊等其他区块链所需的10+分钟要短得多。

Algorand and IPFS
阿尔戈兰和IPFS

Launched 3 years ago in June 2019, Algorand is a layer-1 blockchain with a focus on being highly decentralized, scalable and secured with low transaction cost. Its low latency (i.e. finality) along with a design emphasis in running the blockchain with a forkless operational model makes it an appealing chain for transacting and securing NFTs.
Algorand于3年前于2019年6月推出,是一个一级区块链,专注于高度去中心化、可扩展和安全性以及低交易成本。它的低延迟(即最终性)以及以无分叉运营模式运行区块链的设计重点使其成为交易和保护 NFT 的有吸引力的链。

In this blog post, we’re going to create a simple dApp in JavaScript to mint a NFT on the Algorand Testnet for a digital asset pinned to IPFS. IPFS, short for InterPlanetary File System, is a decentralized peer-to-peer network aimed to serve as a single resilient global network for storing and sharing files.
在这篇博文中,我们将在 JavaScript 中创建一个简单的 dApp,用于在 Algorand 测试网上铸造一个 NFT,用于固定到 IPFS 的数字资产。IPFS是星际文件系统的缩写,是一个分散的点对点网络,旨在作为存储和共享文件的单一弹性全球网络。

Algorand NFT

Algorand comes with its standard asset class, ASA (Algorand Standard Assets), which can be used for representing a variety of customizable digital assets. A typical NFT can be defined configuratively as an ASA by assigning asset parameters with values that conform to Algorand’s NFT specifications.
Algorand附带其标准资产类别ASA(Algorand标准资产),可用于表示各种可定制的数字资产。典型的 NFT 可以通过分配具有符合 Algorand NFT 规范的值的资产参数来配置定义为 ASA。

The two common Algorand specs for NFTs are ARC-3 and ARC-69. A main difference between the two specs is that ARC-69 asset data is kept on-chain whereas ARC-3’s isn’t. We’ll be minting an ARC-3 NFT with the digital asset stored on IPFS.
NFT 的两个常见的 Algorand 规格是 ARC-3 和 ARC-69 。这两个规范之间的主要区别在于,ARC-69资产数据保存在链上,而ARC-3的资产数据则不然。我们将使用存储在 IPFS 上的数字资产铸造 ARC-3 NFT。

Contrary to the previous NFT development exercise on Avalanche that leverages a rich-UI stack (i.e. Scaffold-ETH), this is going to be a barebone proof-of-concept with core focus on how to programmatically create a NFT for a digital asset (an digital image) pinned to IPFS. No web frontend UI.
与之前在 Avalanche 上进行的 NFT 开发练习相反,该练习利用了丰富的 UI 堆栈(即 Scaffold-ETH ),这将是一个准系统的概念验证,其核心重点是如何以编程方式为固定到 IPFS 的数字资产(数字图像)创建 NFT。没有 Web 前端 UI。

Algorand SDKs

Algorand offers SDKs for a few programming languages including JavaScript, Python, Go and Java. We’ll be using the JavaScript SDK. The official developer website provides tutorials for various dApp use cases and one of them is exactly about launching ARC-3 NFT for assets on IPFS. Unfortunately, even though the tutorial is only about 6 months old it no longer works due to some of its code already being obsolete.
Algorand为几种编程语言提供SDK,包括JavaScript、Python、Go和Java。我们将使用 JavaScript SDK。官方开发者网站提供了各种 dApp 用例的教程,其中之一就是为 IPFS 上的资产启动 ARC-3 NFT。不幸的是,即使本教程只有大约 6 个月的历史,但由于它的某些代码已经过时,它不再有效。

It’s apparently a result of the rapidly evolving SDK code — a frustrating but common problem for developers to have to constantly play catch-up game with backward incompatible APIs evolving at a breakneck pace. For example, retrieval of user-level information is no longer supported by the latest Algorand SDK client (algod client), but no where could I find any tutorials doing it otherwise. Presumably for scalability, it turns out such queries are now delegated to an Algorand indexer which runs as an independent process backed by a PostgreSQL compatible database.
这显然是快速发展的SDK代码的结果 - 对于开发人员来说,这是一个令人沮丧但普遍的问题,必须不断玩追赶游戏,向后不兼容的API以惊人的速度发展。例如,最新的Algorand SDK客户端( algod client )不再支持检索用户级信息,但我找不到任何教程来做到这一点。大概是为了可扩展性,事实证明,这些查询现在被委托给Algorand索引器,该索引器作为由PostgreSQL兼容数据库支持的独立进程运行。

Given the doubt of the demo code on the Algorand website being out of date, I find it more straight forward (though a little tedious) to pick up the how-to’s by directly digging into the SDK source code js-algorand-sdk. For instance, one could quickly skim through the method signature and implementation logic of algod client method pendingTransactionInformation from the corresponding source or indexer method lookupAccountByID from indexer’s source for their exact tech specs.
鉴于Algorand网站上的演示代码是否过时,我发现通过直接挖掘SDK源代码js-algorand-sdk来学习操作方法更直接(尽管有点乏味)。例如,可以从相应的源快速浏览 algod 客户端方法 pendingTransactionInformation 的方法签名和实现逻辑,或者从索引器的来源快速浏览 indexer 方法 lookupAccountByID 的方法签名和实现逻辑,以获取确切的技术规范。

Create a NPM project with dependencies
创建具有依赖项的 NPM 项目

Minimal requirements for this Algorand NFT development exercise include the following:
本次 Algorand NFT 开发活动的最低要求包括:

  • Node.js installed with NPM
    节点.js与 NPM 一起安装
  • An account at Pinata, an IPFS pinning service
    Pinata的一个帐户,IPFS固定服务
  • An Algorand compatible crypto wallet (Pera is preferred for the availability of a “developer” mode)
    与Algorand兼容的加密钱包(Pera是“开发者”模式可用性的首选)

First, create a subdirectory as the project root. For example:
首先,创建一个子目录作为项目根目录。例如:

$ mkdir ~/algorand/algo-nft-ipfs/
$ cd ~/algorand/algo-nft-ipfs/
$ mkdir ~/algorand/algo-nft-ipfs/ $ cd ~/algorand/algo-nft-ipfs/
$ mkdir ~/algorand/algo-nft-ipfs/
$ cd ~/algorand/algo-nft-ipfs/

Next, create dependency file package.json with content like below:
接下来,创建依赖项文件 package.json ,内容如下:

{
"name": "algo-nft-ipfs",
"version": "1.0.0",
"description": "Algorand NFT with asset pinned to IPFS",
"scripts": {
"mint": "node algo-nft-ipfs.js"
},
"dependencies": {
"@algonaut/algo-validation-agent": "latest",
"@pinata/sdk": "latest",
"algosdk": "latest",
"bs58": "latest",
"dotenv": "latest",
"ipfs-core": "latest",
"ipfs-http-client": "latest",
"node-base64-image": "latest"
},
"devDependencies": {
"nodemon": "latest",
"parcel-bundler": "latest"
},
"keywords": []
}
{ "name": "algo-nft-ipfs", "version": "1.0.0", "description": "Algorand NFT with asset pinned to IPFS", "scripts": { "mint": "node algo-nft-ipfs.js" }, "dependencies": { "@algonaut/algo-validation-agent": "latest", "@pinata/sdk": "latest", "algosdk": "latest", "bs58": "latest", "dotenv": "latest", "ipfs-core": "latest", "ipfs-http-client": "latest", "node-base64-image": "latest" }, "devDependencies": { "nodemon": "latest", "parcel-bundler": "latest" }, "keywords": [] }
{
    "name": "algo-nft-ipfs",
    "version": "1.0.0",
    "description": "Algorand NFT with asset pinned to IPFS",
    "scripts": {
        "mint": "node algo-nft-ipfs.js"
    },
    "dependencies": {
        "@algonaut/algo-validation-agent": "latest",
        "@pinata/sdk": "latest",
        "algosdk": "latest",
        "bs58": "latest",
        "dotenv": "latest",
        "ipfs-core": "latest",
        "ipfs-http-client": "latest",
        "node-base64-image": "latest"
    },
    "devDependencies": {
        "nodemon": "latest",
        "parcel-bundler": "latest"
    },
    "keywords": []
}

Install the NPM package:
安装 NPM 软件包:

$ npm install
$ npm install
$ npm install

Create file “.env” for keeping private keys
创建文件“.env”以保存私钥

Besides the SDKs for Pinata and Algorand, dotenv is also included in the package.json dependency file, allowing variables such as NFT recipient’s wallet mnemonic, algod client/indexer URLs (for Algorand Testnet) and Pinata API keys, to be kept in file .env in the filesystem.
除了 Pinata 和 Algorand 的 SDK 之外,dotenv 也包含在 package.json 依赖文件中,允许 NFT 接收者的钱包助记符、 algod client / indexer URL(用于 Algorand 测试网)和 Pinata API 密钥等变量保存在文件系统的文件 .env 中。

mnemonic = "<my wallet's mnemonic>"
algodClientUrl = "https://node.testnet.algoexplorerapi.io"
algodClientPort = ""
algodClientToken = ""
indexerUrl = "https://algoindexer.testnet.algoexplorerapi.io"
indexerPort = ""
indexerToken = ""
pinataApiKey = "<my pinata api key>"
pinataApiSecret = "<my pinata api secret>"
mnemonic = "<my wallet's mnemonic>" algodClientUrl = "https://node.testnet.algoexplorerapi.io" algodClientPort = "" algodClientToken = "" indexerUrl = "https://algoindexer.testnet.algoexplorerapi.io" indexerPort = "" indexerToken = "" pinataApiKey = "<my pinata api key>" pinataApiSecret = "<my pinata api secret>"
mnemonic = "<my wallet's mnemonic>"
algodClientUrl = "https://node.testnet.algoexplorerapi.io"
algodClientPort = ""
algodClientToken = ""
indexerUrl = "https://algoindexer.testnet.algoexplorerapi.io"
indexerPort = ""
indexerToken = ""
pinataApiKey = "<my pinata api key>"
pinataApiSecret = "<my pinata api secret>"

Note that Algorand uses a 25-word mnemonic with the 25th word derived from the checksum out of selected words within the 24-word BIP39-compliant mnemonic. Many blockchains use simply use 12-word BIP39 mnemonics.
请注意,Algorand 使用 25 个单词的助记符,其中第 25 个单词源自符合 BIP39 标准的 24 个单词助记符中所选单词的校验和。许多区块链仅使用12字BIP39助记符。

ARC-3 NFT metadata
ARC-3 NFT 元数据

Next, we create file assetMetadata.js for storing algorand ARC-3 specific metadata:
接下来,我们创建文件 assetMetadata.js 来存储 algorand ARC-3 特定元数据:

module.exports = {
arc3MetadataJson: {
"name": "",
"description": "",
"image": "ipfs://",
"image_integrity": "sha256-",
"image_mimetype": "",
"external_url": "",
"external_url_integrity": "",
"external_url_mimetype": "",
"animation_url": "",
"animation_url_integrity": "sha256-",
"animation_url_mimetype": "",
"properties": {
"file_url": "",
"file_url_integrity": "",
"file_url_mimetype": "",
}
}
}
module.exports = { arc3MetadataJson: { "name": "", "description": "", "image": "ipfs://", "image_integrity": "sha256-", "image_mimetype": "", "external_url": "", "external_url_integrity": "", "external_url_mimetype": "", "animation_url": "", "animation_url_integrity": "sha256-", "animation_url_mimetype": "", "properties": { "file_url": "", "file_url_integrity": "", "file_url_mimetype": "", } } }
module.exports = {
    arc3MetadataJson: {
        "name": "",
        "description": "",
        "image": "ipfs://",
        "image_integrity": "sha256-",
        "image_mimetype": "",
        "external_url": "",
        "external_url_integrity": "",
        "external_url_mimetype": "",
        "animation_url": "",
        "animation_url_integrity": "sha256-",
        "animation_url_mimetype": "",
        "properties": {
            "file_url": "",
            "file_url_integrity": "",
            "file_url_mimetype": "",
        }
    }
}

Main application logic
主应用程序逻辑

The main function createNft does a few things:
主函数 createNft 做几件事:

  • Create an account object from the wallet mnemonic stored in file .env
    从存储在文件 .env 中的钱包助记符创建帐户对象
  • Create an digital asset (an image) pinned to IPFS
    创建固定到 IPFS 的数字资产(图像)
  • Create an ARC-3 compliant NFT associated with the pinned asset
    创建与固定资产关联的符合 ARC-3 的 NFT

const createNft = async () => {
try {
let account = createAccount();
console.log("Press any key when the account is funded ...");
await keypress();
const asset = await createAssetOnIpfs();
const { assetID } = await createArc3Asset(asset, account);
}
catch (err) {
console.log("err", err);
};
process.exit();
};
const createNft = async () => { try { let account = createAccount(); console.log("Press any key when the account is funded ..."); await keypress(); const asset = await createAssetOnIpfs(); const { assetID } = await createArc3Asset(asset, account); } catch (err) { console.log("err", err); }; process.exit(); };
const createNft = async () => {
  try {
    let account = createAccount();

    console.log("Press any key when the account is funded ...");
    await keypress();

    const asset = await createAssetOnIpfs();

    const { assetID } = await createArc3Asset(asset, account);
  }
  catch (err) {
    console.log("err", err);
  };

  process.exit();
};

Account object creation
帐户对象创建

Function createAccount is self explanatory. It retrieves the wallet mnemonic from file .env, derives from it the secret key and wallet address (public key) and display a reminder to make sure the account is funded with Algorand Testnet tokens as fees for transactions.
函数 createAccount 是不言自明的。它从文件 .env 中检索钱包助记符,从中获取密钥和钱包地址(公钥),并显示提醒,以确保账户资金来自Algorand测试网代币作为交易费用。

const createAccount = () => {
try {
const mnemonic = process.env.mnemonic
const account = algosdk.mnemonicToSecretKey(mnemonic);
console.log("Derived account address = " + account.addr);
console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr);
return account;
}
catch (err) {
console.log("err", err);
}
};
const createAccount = () => { try { const mnemonic = process.env.mnemonic const account = algosdk.mnemonicToSecretKey(mnemonic); console.log("Derived account address = " + account.addr); console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr); return account; } catch (err) { console.log("err", err); } };
const createAccount = () => {
  try {
    const mnemonic = process.env.mnemonic
    const account = algosdk.mnemonicToSecretKey(mnemonic);

    console.log("Derived account address = " + account.addr);
    console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr);

    return account;
  }
  catch (err) {
    console.log("err", err);
  }
};

Pinning a digital asset to IPFS
将数字资产固定到 IPFS

Function createAssetOnIpfs is responsible for creating and pinning a digital asset to IPFS. This is where the asset attributes including source file path, description, MIME type (e.g. image/png, video/mp4) will be provided. In this example a ninja smiley JPEG under the project root is being used as a placeholder. Simply substitute it with your favorite image, video, etc.
函数 createAssetOnIpfs 负责创建数字资产并将其固定到IPFS。这是资产属性包括源文件路径,描述,MIME类型(例如 将提供 image/png, video/mp4 )。在此示例中,项目根目录下的忍者笑脸 JPEG 用作占位符。只需用您喜欢的图像、视频等替换它即可。

const createAssetOnIpfs = async () => {
return await pinata.testAuthentication().then((res) => {
console.log('Pinata test authentication: ', res);
return assetPinnedToIpfs(
'smiley-ninja-896x896.jpg',
'image/jpeg',
'Ninja Smiley',
'Ninja Smiley 896x896 JPEG image pinned to IPFS'
);
}).catch((err) => {
return console.log(err);
});
}
const createAssetOnIpfs = async () => { return await pinata.testAuthentication().then((res) => { console.log('Pinata test authentication: ', res); return assetPinnedToIpfs( 'smiley-ninja-896x896.jpg', 'image/jpeg', 'Ninja Smiley', 'Ninja Smiley 896x896 JPEG image pinned to IPFS' ); }).catch((err) => { return console.log(err); }); }
const createAssetOnIpfs = async () => {
  return await pinata.testAuthentication().then((res) => {
    console.log('Pinata test authentication: ', res);
    return assetPinnedToIpfs(
      'smiley-ninja-896x896.jpg',
      'image/jpeg',
      'Ninja Smiley',
      'Ninja Smiley 896x896 JPEG image pinned to IPFS'
    );
  }).catch((err) => {
    return console.log(err);
  });
}

The actual pinning work is performed by function assetPinnedToIpfs which returns identifying IPFS URL for the digital asset’s metadata.
实际的固定工作由函数 assetPinnedToIpfs 执行,该函数返回数字资产元数据的标识 IPFS URL。

const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => {
const nftFile = fs.createReadStream(nftFilePath);
...
const pinMeta = {
pinataMetadata: {
name: assetName,
...
},
pinataOptions: {
cidVersion: 0
}
};
const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta);
let metadata = assetMetadata.arc3MetadataJson;
const integrity = ipfsHash(resultFile.IpfsHash);
metadata.name = `${assetName}@arc3`;
metadata.description = assetDesc;
metadata.image = `ipfs://${resultFile.IpfsHash}`;
metadata.image_integrity = `${integrity.cidBase64}`;
metadata.image_mimetype = mimeType;
metadata.properties = ...
...
const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta);
const metaIntegrity = ipfsHash(resultMeta.IpfsHash);
return {
name: `${assetName}@arc3`,
url: `ipfs://${resultMeta.IpfsHash}`,
metadata: metaIntegrity.cidUint8Arr,
integrity: metaIntegrity.cidBase64
};
};
const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => { const nftFile = fs.createReadStream(nftFilePath); ... const pinMeta = { pinataMetadata: { name: assetName, ... }, pinataOptions: { cidVersion: 0 } }; const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta); let metadata = assetMetadata.arc3MetadataJson; const integrity = ipfsHash(resultFile.IpfsHash); metadata.name = `${assetName}@arc3`; metadata.description = assetDesc; metadata.image = `ipfs://${resultFile.IpfsHash}`; metadata.image_integrity = `${integrity.cidBase64}`; metadata.image_mimetype = mimeType; metadata.properties = ... ... const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta); const metaIntegrity = ipfsHash(resultMeta.IpfsHash); return { name: `${assetName}@arc3`, url: `ipfs://${resultMeta.IpfsHash}`, metadata: metaIntegrity.cidUint8Arr, integrity: metaIntegrity.cidBase64 }; };
const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => {
  const nftFile = fs.createReadStream(nftFilePath);
  ...

  const pinMeta = {
    pinataMetadata: {
      name: assetName,
      ...
    },
    pinataOptions: {
      cidVersion: 0
    }
  };

  const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta);

  let metadata = assetMetadata.arc3MetadataJson;

  const integrity = ipfsHash(resultFile.IpfsHash);

  metadata.name = `${assetName}@arc3`;
  metadata.description = assetDesc;
  metadata.image = `ipfs://${resultFile.IpfsHash}`;
  metadata.image_integrity = `${integrity.cidBase64}`;
  metadata.image_mimetype = mimeType;
  metadata.properties = ...
  ...

  const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta);
  const metaIntegrity = ipfsHash(resultMeta.IpfsHash);

  return {
    name: `${assetName}@arc3`,
    url: `ipfs://${resultMeta.IpfsHash}`,
    metadata: metaIntegrity.cidUint8Arr,
    integrity: metaIntegrity.cidBase64
  };
};

Creating an Alogrand ARC-3 NFT
创建一个阿洛格兰德 ARC-3 NFT

Function createNft is just an interfacing wrapper of the createArc3Asset function that does the actual work of initiating and signing the transactions on the Algorand Testnet for creating the ARC-3 NFT associated with the pinned asset using algod client.
函数 createNft 只是 createArc3Asset 函数的接口包装器,它执行在 Algorand 测试网上发起和签署交易的实际工作,以使用 algod client 创建与固定资产关联的 ARC-3 NFT。

const createArc3Asset = async (asset, account) => {
(async () => {
let acct = await indexerClient.lookupAccountByID(account.addr).do();
console.log("Account Address: " + acct['account']['address']);
...
})().catch(e => {
console.error(e);
console.trace();
});
const txParams = await algodClient.getTransactionParams().do();
const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({
from: account.addr,
total: 1,
decimals: 0,
...
unitName: 'nft',
assetName: asset.name,
assetURL: asset.url,
assetMetadataHash: new Uint8Array(asset.metadata),
suggestedParams: txParams
});
const rawSignedTxn = txn.signTxn(account.sk);
const tx = await algodClient.sendRawTransaction(rawSignedTxn).do();
const confirmedTxn = await waitForConfirmation(tx.txId);
const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do();
const assetID = txInfo["asset-index"];
console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID);
console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`);
return { assetID };
}
const createArc3Asset = async (asset, account) => { (async () => { let acct = await indexerClient.lookupAccountByID(account.addr).do(); console.log("Account Address: " + acct['account']['address']); ... })().catch(e => { console.error(e); console.trace(); }); const txParams = await algodClient.getTransactionParams().do(); const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({ from: account.addr, total: 1, decimals: 0, ... unitName: 'nft', assetName: asset.name, assetURL: asset.url, assetMetadataHash: new Uint8Array(asset.metadata), suggestedParams: txParams }); const rawSignedTxn = txn.signTxn(account.sk); const tx = await algodClient.sendRawTransaction(rawSignedTxn).do(); const confirmedTxn = await waitForConfirmation(tx.txId); const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do(); const assetID = txInfo["asset-index"]; console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID); console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`); return { assetID }; }
const createArc3Asset = async (asset, account) => {
  (async () => {
    let acct = await indexerClient.lookupAccountByID(account.addr).do();
    console.log("Account Address: " + acct['account']['address']);
    ...
  })().catch(e => {
    console.error(e);
    console.trace();
  });

  const txParams = await algodClient.getTransactionParams().do();

  const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({
    from: account.addr,
    total: 1,
    decimals: 0,
    ...
    unitName: 'nft',
    assetName: asset.name,
    assetURL: asset.url,
    assetMetadataHash: new Uint8Array(asset.metadata),
    suggestedParams: txParams
  });

  const rawSignedTxn = txn.signTxn(account.sk);
  const tx = await algodClient.sendRawTransaction(rawSignedTxn).do();

  const confirmedTxn = await waitForConfirmation(tx.txId);
  const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do();

  const assetID = txInfo["asset-index"];

  console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID);
  console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`);

  return { assetID };
}

Note that element values total:1 and decimals:0 are for ensuring uniqueness of the NFT. Also worth noting is that Algorand SDK provides a waitForConfirmation() function for awaiting transaction confirmation for a specified number of rounds:
请注意,元素值 total:1decimals:0 用于确保 NFT 的唯一性。另外值得注意的是,Algorand SDK提供了一个 waitForConfirmation() 函数,用于等待指定轮数的交易确认:

const confirmedTxn = await algosdk.waitForConfirmation(algodClient, txId, waitRounds);
const confirmedTxn = await algosdk.waitForConfirmation(algodClient, txId, waitRounds);
const confirmedTxn = await algosdk.waitForConfirmation(algodClient, txId, waitRounds);

However, for some unknown reason, it doesn’t seem to work with a fixed waitRounds, thus a custom function seen in a demo app on Algorand’s website is being used instead. The custom code simply verifies returned value from algod client method pendingTransactionInformation(txId) in a loop.
但是,由于某种未知原因,它似乎不适用于固定的 waitRounds ,因此正在使用Algorand网站上的演示应用程序中看到的自定义函数。自定义代码只是在循环中验证来自 algod 客户端方法 pendingTransactionInformation(txId) 的返回值。

Putting everything together
将所有内容放在一起

For simplicity, the above code logic is all put in a single JavaScript script algo-nft-ipfs.js under the project root.
为简单起见,上面的代码逻辑都放在项目根目录下的单个 JavaScript 脚本 algo-nft-ipfs.js 中。

const fs = require('fs');
const path = require('path');
const algosdk = require('algosdk');
const bs58 = require('bs58');
require('dotenv').config()
const assetMetadata = require('./assetMetadata');
const algodClient = new algosdk.Algodv2(
process.env.algodClientToken,
process.env.algodClientUrl,
process.env.algodClientPort
);
const indexerClient = new algosdk.Indexer(
process.env.indexerToken,
process.env.indexerUrl,
process.env.indexerPort
);
const pinataApiKey = process.env.pinataApiKey;
const pinataApiSecret = process.env.pinataApiSecret;
const pinataSdk = require('@pinata/sdk');
const pinata = pinataSdk(pinataApiKey, pinataApiSecret);
const keypress = async () => {
process.stdin.setRawMode(true);
return new Promise(resolve => process.stdin.once('data', () => {
process.stdin.setRawMode(false)
resolve()
}));
};
const waitForConfirmation = async (txId) => {
const status = await algodClient.status().do();
let lastRound = status["last-round"];
let txInfo = null;
while (true) {
txInfo = await algodClient.pendingTransactionInformation(txId).do();
if (txInfo["confirmed-round"] !== null && txInfo["confirmed-round"] > 0) {
console.log("Transaction " + txId + " confirmed in round " + txInfo["confirmed-round"]);
break;
}
lastRound ++;
await algodClient.statusAfterBlock(lastRound).do();
}
return txInfo;
}
const createAccount = () => {
try {
const mnemonic = process.env.mnemonic
const account = algosdk.mnemonicToSecretKey(mnemonic);
console.log("Derived account address = " + account.addr);
console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr);
return account;
}
catch (err) {
console.log("err", err);
}
};
const ipfsHash = (cid) => {
const cidUint8Arr = bs58.decode(cid).slice(2);
const cidBase64 = cidUint8Arr.toString('base64');
return { cidUint8Arr, cidBase64 };
};
const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => {
const nftFile = fs.createReadStream(nftFilePath);
const nftFileName = nftFilePath.split('/').pop();
const properties = {
"file_url": nftFileName,
"file_url_integrity": "",
"file_url_mimetype": mimeType
};
const pinMeta = {
pinataMetadata: {
name: assetName,
keyvalues: {
"url": nftFileName,
"mimetype": mimeType
}
},
pinataOptions: {
cidVersion: 0
}
};
const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta);
console.log('Asset pinned to IPFS via Pinata: ', resultFile);
let metadata = assetMetadata.arc3MetadataJson;
const integrity = ipfsHash(resultFile.IpfsHash);
metadata.name = `${assetName}@arc3`;
metadata.description = assetDesc;
metadata.image = `ipfs://${resultFile.IpfsHash}`;
metadata.image_integrity = `${integrity.cidBase64}`;
metadata.image_mimetype = mimeType;
metadata.properties = properties;
metadata.properties.file_url = `https://ipfs.io/ipfs/${resultFile.IpfsHash}`;
metadata.properties.file_url_integrity = `${integrity.cidBase64}`;
console.log('Algorand NFT-IPFS metadata: ', metadata);
const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta);
const metaIntegrity = ipfsHash(resultMeta.IpfsHash);
console.log('Asset metadata pinned to IPFS via Pinata: ', resultMeta);
return {
name: `${assetName}@arc3`,
url: `ipfs://${resultMeta.IpfsHash}`,
metadata: metaIntegrity.cidUint8Arr,
integrity: metaIntegrity.cidBase64
};
};
const createAssetOnIpfs = async () => {
return await pinata.testAuthentication().then((res) => {
console.log('Pinata test authentication: ', res);
return assetPinnedToIpfs(
'smiley-ninja-896x896.jpg',
'image/jpeg',
'Ninja Smiley',
'Ninja Smiley 896x896 JPEG image pinned to IPFS'
);
}).catch((err) => {
return console.log(err);
});
}
const createArc3Asset = async (asset, account) => {
(async () => {
let acct = await indexerClient.lookupAccountByID(account.addr).do();
console.log("Account Address: " + acct['account']['address']);
console.log(" Amount: " + acct['account']['amount']);
console.log(" Rewards: " + acct['account']['rewards']);
console.log(" Created Assets: " + acct['account']['total-created-assets']);
console.log(" Current Round: " + acct['current-round']);
})().catch(e => {
console.error(e);
console.trace();
});
const txParams = await algodClient.getTransactionParams().do();
const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({
from: account.addr,
total: 1,
decimals: 0,
defaultFrozen: false,
manager: account.addr,
reserve: undefined,
freeze: undefined,
clawback: undefined,
unitName: 'nft',
assetName: asset.name,
assetURL: asset.url,
assetMetadataHash: new Uint8Array(asset.metadata),
suggestedParams: txParams
});
const rawSignedTxn = txn.signTxn(account.sk);
const tx = await algodClient.sendRawTransaction(rawSignedTxn).do();
// const confirmedTxn = await algosdk.waitForConfirmation(algodClient, tx, 4);
// /* Error: Transaction not confirmed after 4 rounds */
const confirmedTxn = await waitForConfirmation(tx.txId);
const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do();
const assetID = txInfo["asset-index"];
console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID);
console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`);
return { assetID };
}
const createNft = async () => {
try {
let account = createAccount();
console.log("Press any key when the account is funded ...");
await keypress();
const asset = await createAssetOnIpfs();
const { assetID } = await createArc3Asset(asset, account);
}
catch (err) {
console.log("err", err);
};
process.exit();
};
createNft();
const fs = require('fs'); const path = require('path'); const algosdk = require('algosdk'); const bs58 = require('bs58'); require('dotenv').config() const assetMetadata = require('./assetMetadata'); const algodClient = new algosdk.Algodv2( process.env.algodClientToken, process.env.algodClientUrl, process.env.algodClientPort ); const indexerClient = new algosdk.Indexer( process.env.indexerToken, process.env.indexerUrl, process.env.indexerPort ); const pinataApiKey = process.env.pinataApiKey; const pinataApiSecret = process.env.pinataApiSecret; const pinataSdk = require('@pinata/sdk'); const pinata = pinataSdk(pinataApiKey, pinataApiSecret); const keypress = async () => { process.stdin.setRawMode(true); return new Promise(resolve => process.stdin.once('data', () => { process.stdin.setRawMode(false) resolve() })); }; const waitForConfirmation = async (txId) => { const status = await algodClient.status().do(); let lastRound = status["last-round"]; let txInfo = null; while (true) { txInfo = await algodClient.pendingTransactionInformation(txId).do(); if (txInfo["confirmed-round"] !== null && txInfo["confirmed-round"] > 0) { console.log("Transaction " + txId + " confirmed in round " + txInfo["confirmed-round"]); break; } lastRound ++; await algodClient.statusAfterBlock(lastRound).do(); } return txInfo; } const createAccount = () => { try { const mnemonic = process.env.mnemonic const account = algosdk.mnemonicToSecretKey(mnemonic); console.log("Derived account address = " + account.addr); console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr); return account; } catch (err) { console.log("err", err); } }; const ipfsHash = (cid) => { const cidUint8Arr = bs58.decode(cid).slice(2); const cidBase64 = cidUint8Arr.toString('base64'); return { cidUint8Arr, cidBase64 }; }; const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => { const nftFile = fs.createReadStream(nftFilePath); const nftFileName = nftFilePath.split('/').pop(); const properties = { "file_url": nftFileName, "file_url_integrity": "", "file_url_mimetype": mimeType }; const pinMeta = { pinataMetadata: { name: assetName, keyvalues: { "url": nftFileName, "mimetype": mimeType } }, pinataOptions: { cidVersion: 0 } }; const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta); console.log('Asset pinned to IPFS via Pinata: ', resultFile); let metadata = assetMetadata.arc3MetadataJson; const integrity = ipfsHash(resultFile.IpfsHash); metadata.name = `${assetName}@arc3`; metadata.description = assetDesc; metadata.image = `ipfs://${resultFile.IpfsHash}`; metadata.image_integrity = `${integrity.cidBase64}`; metadata.image_mimetype = mimeType; metadata.properties = properties; metadata.properties.file_url = `https://ipfs.io/ipfs/${resultFile.IpfsHash}`; metadata.properties.file_url_integrity = `${integrity.cidBase64}`; console.log('Algorand NFT-IPFS metadata: ', metadata); const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta); const metaIntegrity = ipfsHash(resultMeta.IpfsHash); console.log('Asset metadata pinned to IPFS via Pinata: ', resultMeta); return { name: `${assetName}@arc3`, url: `ipfs://${resultMeta.IpfsHash}`, metadata: metaIntegrity.cidUint8Arr, integrity: metaIntegrity.cidBase64 }; }; const createAssetOnIpfs = async () => { return await pinata.testAuthentication().then((res) => { console.log('Pinata test authentication: ', res); return assetPinnedToIpfs( 'smiley-ninja-896x896.jpg', 'image/jpeg', 'Ninja Smiley', 'Ninja Smiley 896x896 JPEG image pinned to IPFS' ); }).catch((err) => { return console.log(err); }); } const createArc3Asset = async (asset, account) => { (async () => { let acct = await indexerClient.lookupAccountByID(account.addr).do(); console.log("Account Address: " + acct['account']['address']); console.log(" Amount: " + acct['account']['amount']); console.log(" Rewards: " + acct['account']['rewards']); console.log(" Created Assets: " + acct['account']['total-created-assets']); console.log(" Current Round: " + acct['current-round']); })().catch(e => { console.error(e); console.trace(); }); const txParams = await algodClient.getTransactionParams().do(); const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({ from: account.addr, total: 1, decimals: 0, defaultFrozen: false, manager: account.addr, reserve: undefined, freeze: undefined, clawback: undefined, unitName: 'nft', assetName: asset.name, assetURL: asset.url, assetMetadataHash: new Uint8Array(asset.metadata), suggestedParams: txParams }); const rawSignedTxn = txn.signTxn(account.sk); const tx = await algodClient.sendRawTransaction(rawSignedTxn).do(); // const confirmedTxn = await algosdk.waitForConfirmation(algodClient, tx, 4); // /* Error: Transaction not confirmed after 4 rounds */ const confirmedTxn = await waitForConfirmation(tx.txId); const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do(); const assetID = txInfo["asset-index"]; console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID); console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`); return { assetID }; } const createNft = async () => { try { let account = createAccount(); console.log("Press any key when the account is funded ..."); await keypress(); const asset = await createAssetOnIpfs(); const { assetID } = await createArc3Asset(asset, account); } catch (err) { console.log("err", err); }; process.exit(); }; createNft();
const fs = require('fs');
const path = require('path');
const algosdk = require('algosdk');
const bs58 = require('bs58');

require('dotenv').config()
const assetMetadata = require('./assetMetadata');

const algodClient = new algosdk.Algodv2(
  process.env.algodClientToken,
  process.env.algodClientUrl,
  process.env.algodClientPort
);

const indexerClient = new algosdk.Indexer(
  process.env.indexerToken,
  process.env.indexerUrl,
  process.env.indexerPort
);

const pinataApiKey = process.env.pinataApiKey;
const pinataApiSecret = process.env.pinataApiSecret;
const pinataSdk = require('@pinata/sdk');
const pinata = pinataSdk(pinataApiKey, pinataApiSecret);

const keypress = async () => {
  process.stdin.setRawMode(true);
  return new Promise(resolve => process.stdin.once('data', () => {
    process.stdin.setRawMode(false)
    resolve()
  }));
};

const waitForConfirmation = async (txId) => {
  const status = await algodClient.status().do();
  let lastRound = status["last-round"];
  let txInfo = null;

  while (true) {
    txInfo = await algodClient.pendingTransactionInformation(txId).do();
    if (txInfo["confirmed-round"] !== null && txInfo["confirmed-round"] > 0) {
      console.log("Transaction " + txId + " confirmed in round " + txInfo["confirmed-round"]);
      break;
    }
    lastRound ++;
    await algodClient.statusAfterBlock(lastRound).do();
  }

  return txInfo;
}

const createAccount = () => {
  try {
    const mnemonic = process.env.mnemonic
    const account = algosdk.mnemonicToSecretKey(mnemonic);

    console.log("Derived account address = " + account.addr);
    console.log("To add funds to the account, visit https://dispenser.testnet.aws.algodev.network/?account=" + account.addr);

    return account;
  }
  catch (err) {
    console.log("err", err);
  }
};

const ipfsHash = (cid) => {
  const cidUint8Arr = bs58.decode(cid).slice(2);
  const cidBase64 = cidUint8Arr.toString('base64');
  return { cidUint8Arr, cidBase64 };
};

const assetPinnedToIpfs = async (nftFilePath, mimeType, assetName, assetDesc) => {
  const nftFile = fs.createReadStream(nftFilePath);
  const nftFileName = nftFilePath.split('/').pop();
  
  const properties = {
    "file_url": nftFileName,
    "file_url_integrity": "",
    "file_url_mimetype": mimeType
  };

  const pinMeta = {
    pinataMetadata: {
      name: assetName,
      keyvalues: {
        "url": nftFileName,
        "mimetype": mimeType
      }
    },
    pinataOptions: {
      cidVersion: 0
    }
  };

  const resultFile = await pinata.pinFileToIPFS(nftFile, pinMeta);
  console.log('Asset pinned to IPFS via Pinata: ', resultFile);

  let metadata = assetMetadata.arc3MetadataJson;

  const integrity = ipfsHash(resultFile.IpfsHash);

  metadata.name = `${assetName}@arc3`;
  metadata.description = assetDesc;
  metadata.image = `ipfs://${resultFile.IpfsHash}`;
  metadata.image_integrity = `${integrity.cidBase64}`;
  metadata.image_mimetype = mimeType;
  metadata.properties = properties;
  metadata.properties.file_url = `https://ipfs.io/ipfs/${resultFile.IpfsHash}`;
  metadata.properties.file_url_integrity = `${integrity.cidBase64}`;

  console.log('Algorand NFT-IPFS metadata: ', metadata);

  const resultMeta = await pinata.pinJSONToIPFS(metadata, pinMeta);
  const metaIntegrity = ipfsHash(resultMeta.IpfsHash);
  console.log('Asset metadata pinned to IPFS via Pinata: ', resultMeta);

  return {
    name: `${assetName}@arc3`,
    url: `ipfs://${resultMeta.IpfsHash}`,
    metadata: metaIntegrity.cidUint8Arr,
    integrity: metaIntegrity.cidBase64
  };
};

const createAssetOnIpfs = async () => {
  return await pinata.testAuthentication().then((res) => {
    console.log('Pinata test authentication: ', res);
    return assetPinnedToIpfs(
      'smiley-ninja-896x896.jpg',
      'image/jpeg',
      'Ninja Smiley',
      'Ninja Smiley 896x896 JPEG image pinned to IPFS'
    );
  }).catch((err) => {
    return console.log(err);
  });
}

const createArc3Asset = async (asset, account) => {
  (async () => {
    let acct = await indexerClient.lookupAccountByID(account.addr).do();
    console.log("Account Address: " + acct['account']['address']);
    console.log("         Amount: " + acct['account']['amount']);
    console.log("        Rewards: " + acct['account']['rewards']);
    console.log(" Created Assets: " + acct['account']['total-created-assets']);
    console.log("  Current Round: " + acct['current-round']);
  })().catch(e => {
    console.error(e);
    console.trace();
  });

  const txParams = await algodClient.getTransactionParams().do();

  const txn = algosdk.makeAssetCreateTxnWithSuggestedParamsFromObject({
    from: account.addr,
    total: 1,
    decimals: 0,
    defaultFrozen: false,
    manager: account.addr,
    reserve: undefined,
    freeze: undefined,
    clawback: undefined,
    unitName: 'nft',
    assetName: asset.name,
    assetURL: asset.url,
    assetMetadataHash: new Uint8Array(asset.metadata),
    suggestedParams: txParams
  });

  const rawSignedTxn = txn.signTxn(account.sk);
  const tx = await algodClient.sendRawTransaction(rawSignedTxn).do();

  // const confirmedTxn = await algosdk.waitForConfirmation(algodClient, tx, 4);
  // /* Error: Transaction not confirmed after 4 rounds */
  const confirmedTxn = await waitForConfirmation(tx.txId);
  const txInfo = await algodClient.pendingTransactionInformation(tx.txId).do();

  const assetID = txInfo["asset-index"];

  console.log('Account ', account.addr, ' has created ARC3 compliant NFT with asset ID', assetID);
  console.log(`Check it out at https://testnet.algoexplorer.io/asset/${assetID}`);

  return { assetID };
}

const createNft = async () => {
  try {
    let account = createAccount();

    console.log("Press any key when the account is funded ...");
    await keypress();

    const asset = await createAssetOnIpfs();

    const { assetID } = await createArc3Asset(asset, account);
  }
  catch (err) {
    console.log("err", err);
  };

  process.exit();
};

createNft();

Minting the Algorand NFT
铸造阿尔戈兰 NFT

To mint the ARC-3 compliant NFT:
要铸造符合 ARC-3 标准的 NFT:

$ npm run mint
$ npm run mint
$ npm run mint

Upon successful minting of the NFT, you’ll see messages similar to the following with a reminder for where to look up for details about the NFT and its associated transactions on the Algorand Testnet:
成功铸造 NFT 后,您将看到类似于以下内容的消息,提醒您在 Algorand 测试网上查找有关 NFT 及其相关交易的详细信息:

Account <accountID> has created ARC3 compliant NFT with asset ID: <assetID>
Check it out at https://testnet.algoexplorer.io/asset/<assetID>

Verifying …
验证。。。

From the algoexplorer.io website, the URL of the IPFS metadata file for the NFT should look like below:
algoexplorer.io 网站上,NFT 的 IPFS 元数据文件的 URL 应如下所示:

ipfs://<ipfsMetadataHash>

Value <ipfsMetadataHash> should match the hash value of the pinned asset metadata file under your Pinata account and should be viewable at:
<ipfsMetadataHash> 应与 Pinata 帐户下固定资产元数据文件的哈希值匹配,并且应在以下位置查看:

https://gateway.pinata.cloud/ipfs/<ipfsMetadataHash>

There are a few Algorand compatible wallets with Pera being the one created by the Algorand’s development team. For developers, Pera has a convenient option for switching between the Algorand Mainnet and Testnet. To verify the receipt of the NFT, simply switch to Algorand Testnet (under Settings > Developer Settings > Node Settings).
有一些与Algorand兼容的钱包,Pera是由Algorand的开发团队创建的钱包。对于开发者来说,Pera有一个方便的选项,可以在Algorand主网和测试网之间切换。要验证是否收到 NFT,只需切换到 Algorand 测试网(在 Settings > Developer Settings > Node Settings 下)。

Here’s what the received NFT in a Pera wallet would look like:
以下是 Pera 钱包中收到的 NFT 的样子:

Pera - Ninja Smiley ARC3 NFT

From within Pera Explorer:
从Pera Explorer中:

Pera Explorer - Ninja Smiley ARC3

1 thought on “Algorand NFTs With IPFS Assets
关于“具有 IPFS 资产的 Algorand NFT”的 1 条思考

  1. Leo Cheung Post author
    张国荣邮报作者 七月2,2022 6时:下午08

    LATE ADDENDUM: Source code illustrated in this blog post has been pushed to a GitHub repo.
    后期附录:这篇博文中说明的源代码已被推送到 GitHub 存储库。

    Reply
    回复 ↓

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

A Crossword Puzzle In Scala
斯卡拉的填字游戏

As a programming exercise, creating and solving a crossword puzzle is a fun one, requiring some non-trivial but reasonable effort in crafting out the necessary programming logic.
作为一项编程练习,创建和解决填字游戏是一个有趣的练习,需要一些不平凡但合理的努力来制定必要的编程逻辑。

The high-level requirements are simple — Given a square-shaped crossword board with intersecting word slots, and a set of words that are supposed to fit in the slots, solve the crossword and display the resulting word-populated board.
高级要求很简单 — 给定一个带有相交词槽的方形填字游戏板,以及一组应该适合插槽的单词,求解填字游戏并显示生成的单词填充板。

Data structure for the crossword board
填字游戏板的数据结构

First, it would help make the implementation more readable by coming up with a couple of inter-related classes to represent the crossword board’s layout:
首先,它通过提出几个相互关联的类来表示填字游戏板的布局,有助于使实现更具可读性:

Class Board represents the crossword board which can be constructed with the following parameters:
Board 表示可以使用以下参数构造的填字游戏板:

  • sz: Size of the square-shaped board with dimension sz x sz
    sz:尺寸为 sz x sz 的方形板的大小
  • bgCh: Background character representing a single cell of the board which can be identified with its XY-coordinates (default ‘.’)
    bgCh:代表电路板单个单元格的背景字符,可以用其 XY 坐标(默认为“.”)进行识别
  • spCh: Space character representing a single empty cell of a word slot (default ‘*’)
    spCh:表示字槽的单个空单元格的空格字符(默认为“*”)
  • slots: A set of word slots, each of which is represented as an instance of class Slot
    插槽:一组字槽,每个时隙都表示为类 Slot 的一个实例
  • arr: The “hidden” character array of dimension sz x sz that stores the content (bgCh, spCh (in empty slots) and words)
    arr:维度 sz x sz 的“隐藏”字符数组,用于存储内容( bgChspCh (在空插槽中)和单词)

Class Slot represents an individual word slot with these parameters:
Slot 表示具有以下参数的单个字槽:

  • start: The XY-coordinates of the starting character of the word slot
    start:单词槽的起始字符的 XY 坐标
  • horiz: A boolean indicator whether a word slot is horizontal (true) or vertical (false)
    horiz:一个布尔指示符,一个布尔指标,一个字槽是水平的(真)还是垂直的(假)
  • len: Length of the word slot
    len:单词槽的长度
  • jctIdxs: Indexes of all the junction cells that intersect any other word slots
    jctIdxs:与任何其他字槽相交的所有连接单元的索引
  • word: A word that fits the slot in length and matches characters that intersect other word slots at its junctions (default “”)
    word:适合槽长度并匹配在其交汇处与其他字槽相交的字符的字词(默认为“”)

Thus, we have the skeletal classes:
因此,我们有骨架类:

case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "")
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)
}
case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "") case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) { private val arr: Array[Array[Char]] = Array.ofDim(sz, sz) }
case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "")

case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
  private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)
}

Next, we expand class Board to include a few methods for recursively filling its word slots with matching words along with a couple of side-effecting methods for displaying board content, error reporting, etc.
接下来,我们扩展类 Board 以包含一些使用匹配单词递归填充其字槽的方法,以及一些用于显示电路板内容、错误报告等的副作用方法。

case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)
// Print content of a provided array
def printArr(a: Array[Array[Char]] = arr): Unit
// Report error messages
def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit
// Fill slots with words of matching lengths & junction characters
def fillSlotsWithWords(words: Array[String]): Board
// Update board array from slots
def updateArrayFromSlots(): Board
}
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) { private val arr: Array[Array[Char]] = Array.ofDim(sz, sz) // Print content of a provided array def printArr(a: Array[Array[Char]] = arr): Unit // Report error messages def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit // Fill slots with words of matching lengths & junction characters def fillSlotsWithWords(words: Array[String]): Board // Update board array from slots def updateArrayFromSlots(): Board }
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
  private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)

  // Print content of a provided array
  def printArr(a: Array[Array[Char]] = arr): Unit

  // Report error messages
  def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit

  // Fill slots with words of matching lengths & junction characters
  def fillSlotsWithWords(words: Array[String]): Board

  // Update board array from slots
  def updateArrayFromSlots(): Board
}

To separate initialization tasks from the core crossword solving logic, we add an companion object to class Board.
为了将初始化任务与核心填字游戏求解逻辑分开,我们将一个配套对象添加到类 Board 中。

object Board {
// Initialize a board's array from provided background char, slot space char & slots
def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board
// Convert an index of a slot into coordinates
def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int)
// Initialize a board and create slots from board array with ONLY layout
def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board
def findSlot(i: Int, j: Int): Option[Slot]
def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot]
}
object Board { // Initialize a board's array from provided background char, slot space char & slots def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board // Convert an index of a slot into coordinates def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int) // Initialize a board and create slots from board array with ONLY layout def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board def findSlot(i: Int, j: Int): Option[Slot] def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] }
object Board {
  // Initialize a board's array from provided background char, slot space char & slots
  def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board

  // Convert an index of a slot into coordinates
  def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int)

  // Initialize a board and create slots from board array with ONLY layout
  def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board
    def findSlot(i: Int, j: Int): Option[Slot]
    def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot]
}

Creating a crossword puzzle
创建填字游戏

There are a couple of ways to construct a properly initialized Board object:
有几种方法可以构造正确初始化的 Board 对象:

  1. From a set of pre-defined word slots, or,
    从一组预定义的字槽,或者,
  2. From a sz x sz array of the board layout with pre-populated cells of background and empty world slot
    sz x sz 板布局数组,其中包含预填充的背景单元格和空世界槽
object Board {
// Initialize a board's array from provided background char, slot space char & slots
def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = {
val board = new Board(sz, bgCh, spCh, slots)
val slotCoords = slots.toList.flatMap{ slot =>
val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1)
List(i, j)
}
require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!")
for (i <- 0 until sz; j <- 0 until sz) {
board.arr(i)(j) = bgCh
}
slots.foreach{ slot =>
val (i, j) = (slot.start._1, slot.start._2)
(0 until slot.len).foreach{ k =>
val (i, j) = idxToCoords(slot.start, slot.horiz, k)
board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
}
}
board
}
...
}
object Board { // Initialize a board's array from provided background char, slot space char & slots def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = { val board = new Board(sz, bgCh, spCh, slots) val slotCoords = slots.toList.flatMap{ slot => val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1) List(i, j) } require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!") for (i <- 0 until sz; j <- 0 until sz) { board.arr(i)(j) = bgCh } slots.foreach{ slot => val (i, j) = (slot.start._1, slot.start._2) (0 until slot.len).foreach{ k => val (i, j) = idxToCoords(slot.start, slot.horiz, k) board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k) } } board } ... }
object Board {
  // Initialize a board's array from provided background char, slot space char & slots
  def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = {
    val board = new Board(sz, bgCh, spCh, slots)
    val slotCoords = slots.toList.flatMap{ slot =>
      val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1)
      List(i, j)
    }
    require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!")
    for (i <- 0 until sz; j <- 0 until sz) {
      board.arr(i)(j) = bgCh
    }
    slots.foreach{ slot =>
      val (i, j) = (slot.start._1, slot.start._2)
      (0 until slot.len).foreach{ k =>
        val (i, j) = idxToCoords(slot.start, slot.horiz, k)
        board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
      }
    }
    board
  }

  ...
}

Method Board.apply() constructs a Board object by populating the private array field with the provided character bgCh to represent the background cells. Once initialized, empty slots represented by the other provided character spCh can be “carved out” in accordance with a set of pre-defined word Slot objects.
方法 Board.apply() 通过使用提供的字符 bgCh 填充私有数组字段来构造 Board 对象以表示背景单元格。初始化后,由另一个提供的字符 spCh 表示的空插槽可以根据一组预定义的单词 Slot 对象“雕刻出来”。

Example:

val emptySlots = Set(
Slot((6, 8), false, 4, Set(2), ""),
Slot((3, 1), true, 6, Set(2, 5), ""),
Slot((3, 6), false, 7, Set(0, 5), ""),
Slot((1, 9), false, 5, Set(0), ""),
Slot((1, 3), false, 5, Set(0, 2, 4), ""),
Slot((8, 1), true, 8, Set(5, 7), ""),
Slot((1, 2), true, 8, Set(1, 7), ""),
Slot((5, 0), true, 4, Set(3), "")
)
val board = Board(sz = 10, slots = emptySlots)
board.printArr()
/*
. . . . . . . . . .
. . * * * * * * * *
. . . * . . . . . *
. * * * * * * . . *
. . . * . . * . . *
* * * * . . * . . *
. . . . . . * . * .
. . . . . . * . * .
. * * * * * * * * .
. . . . . . * . * .
*/
val emptySlots = Set( Slot((6, 8), false, 4, Set(2), ""), Slot((3, 1), true, 6, Set(2, 5), ""), Slot((3, 6), false, 7, Set(0, 5), ""), Slot((1, 9), false, 5, Set(0), ""), Slot((1, 3), false, 5, Set(0, 2, 4), ""), Slot((8, 1), true, 8, Set(5, 7), ""), Slot((1, 2), true, 8, Set(1, 7), ""), Slot((5, 0), true, 4, Set(3), "") ) val board = Board(sz = 10, slots = emptySlots) board.printArr() /* . . . . . . . . . . . . * * * * * * * * . . . * . . . . . * . * * * * * * . . * . . . * . . * . . * * * * * . . * . . * . . . . . . * . * . . . . . . . * . * . . * * * * * * * * . . . . . . . * . * . */
val emptySlots = Set(
  Slot((6, 8), false, 4, Set(2), ""),
  Slot((3, 1), true, 6, Set(2, 5), ""),
  Slot((3, 6), false, 7, Set(0, 5), ""),
  Slot((1, 9), false, 5, Set(0), ""),
  Slot((1, 3), false, 5, Set(0, 2, 4), ""),
  Slot((8, 1), true, 8, Set(5, 7), ""),
  Slot((1, 2), true, 8, Set(1, 7), ""),
  Slot((5, 0), true, 4, Set(3), "")
)

val board = Board(sz = 10, slots = emptySlots)

board.printArr()
/*
. . . . . . . . . . 
. . * * * * * * * * 
. . . * . . . . . * 
. * * * * * * . . * 
. . . * . . * . . * 
* * * * . . * . . * 
. . . . . . * . * . 
. . . . . . * . * . 
. * * * * * * * * . 
. . . . . . * . * . 
*/

Alternatively, one could construct a Board by providing a pre-populated array of the board layout.
或者,可以通过提供预先填充的电路板布局数组来构建 Board

object Board {
...
// Initialize a board and create slots from board array with ONLY layout
def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = {
val sz = layout.length
def findSlot(i: Int, j: Int): Option[Slot] = {
if (j < sz-1 && layout(i)(j+1) == sp) { // Horizontal
val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size
val js = (0 until ln).collect{
case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k
}.toSet
Option.when(ln > 1)(Slot((i, j), true, ln, js))
}
else { // Vertical
val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size
val js = (0 until ln).collect{
case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k
}.toSet
Option.when(ln > 1)(Slot((i, j), false, ln, js))
}
}
def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = {
if (i == sz) slots
else {
if (j == sz)
createSlots(i+1, 0, slots, mkCh)
else {
findSlot(i, j) match {
case Some(slot) =>
val jctCoords = slot.jctIdxs.map{ idx =>
Board.idxToCoords(slot.start, slot.horiz, idx)
}
(0 until slot.len).foreach { k =>
val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
if (!jctCoords.contains((i, j)))
layout(i)(j) = mkCh
}
createSlots(i, j+1, slots + slot, mkCh)
case None =>
createSlots(i, j+1, slots, mkCh)
}
}
}
}
val mkCh = '\u0000'
val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh))
(0 until sz).foreach{ i =>
newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch }
}
newBoard
}
}
object Board { ... // Initialize a board and create slots from board array with ONLY layout def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = { val sz = layout.length def findSlot(i: Int, j: Int): Option[Slot] = { if (j < sz-1 && layout(i)(j+1) == sp) { // Horizontal val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size val js = (0 until ln).collect{ case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k }.toSet Option.when(ln > 1)(Slot((i, j), true, ln, js)) } else { // Vertical val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size val js = (0 until ln).collect{ case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k }.toSet Option.when(ln > 1)(Slot((i, j), false, ln, js)) } } def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = { if (i == sz) slots else { if (j == sz) createSlots(i+1, 0, slots, mkCh) else { findSlot(i, j) match { case Some(slot) => val jctCoords = slot.jctIdxs.map{ idx => Board.idxToCoords(slot.start, slot.horiz, idx) } (0 until slot.len).foreach { k => val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k) if (!jctCoords.contains((i, j))) layout(i)(j) = mkCh } createSlots(i, j+1, slots + slot, mkCh) case None => createSlots(i, j+1, slots, mkCh) } } } } val mkCh = '\u0000' val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh)) (0 until sz).foreach{ i => newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch } } newBoard } }
object Board {
  ...

  // Initialize a board and create slots from board array with ONLY layout
  def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = {
    val sz = layout.length
    def findSlot(i: Int, j: Int): Option[Slot] = {
      if (j < sz-1 && layout(i)(j+1) == sp) {  // Horizontal
        val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size
        val js = (0 until ln).collect{
          case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k
        }.toSet
        Option.when(ln > 1)(Slot((i, j), true, ln, js))
      }
      else {  // Vertical
        val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size
        val js = (0 until ln).collect{
          case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k
        }.toSet
        Option.when(ln > 1)(Slot((i, j), false, ln, js))
      }
    }
    def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = {
      if (i == sz) slots
      else {
        if (j == sz)
          createSlots(i+1, 0, slots, mkCh)
        else {
          findSlot(i, j) match {
            case Some(slot) =>
              val jctCoords = slot.jctIdxs.map{ idx =>
                Board.idxToCoords(slot.start, slot.horiz, idx)
              }
              (0 until slot.len).foreach { k =>
                val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
                if (!jctCoords.contains((i, j)))
                  layout(i)(j) = mkCh
              }
              createSlots(i, j+1, slots + slot, mkCh)
            case None =>
              createSlots(i, j+1, slots, mkCh)
          }
        }
      }
    }
    val mkCh = '\u0000'
    val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh))
    (0 until sz).foreach{ i =>
      newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch }
    }
    newBoard
  }
}

Method Board.initBoardLayout() kicks off createSlots(), which parses every element of the array to identify the “head” of any word slot and figure out its “tail” and indexes of cells that intersect with other slots. The mkCh character is for temporarily masking off any identified slot cell that is not a intersecting junction during the parsing.
方法 Board.initBoardLayout() 启动 createSlots() ,它解析数组的每个元素以识别任何词槽的“头”,并找出它的“尾巴”和与其他时隙相交的单元格索引。 mkCh 字符用于在解析过程中暂时屏蔽任何不是相交交汇点的已识别插槽单元。

Example:

val arr = Array(
Array('.', '.', '.', '.', '.', '.', '.', '.', '.', '.'),
Array('.', '.', '*', '*', '*', '*', '*', '*', '*', '*'),
Array('.', '.', '.', '*', '.', '.', '.', '.', '.', '*'),
Array('.', '*', '*', '*', '*', '*', '*', '.', '.', '*'),
Array('.', '.', '.', '*', '.', '.', '*', '.', '.', '*'),
Array('*', '*', '*', '*', '.', '.', '*', '.', '.', '*'),
Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'),
Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'),
Array('.', '*', '*', '*', '*', '*', '*', '*', '*', '.'),
Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.')
)
val board = Board.initBoardLayout(arr)
/*
Board(10, '.', '*', HashSet(
Slot((6, 8), false, 4, Set(2), ""),
Slot((3, 1), true, 6, Set(2, 5), ""),
Slot((3, 6), false, 7, Set(0, 5), ""),
Slot((1, 9), false, 5, Set(0), ""),
Slot((1, 3), false, 5, Set(0, 2, 4), ""),
Slot((8, 1), true, 8, Set(5, 7), ""),
Slot((1, 2), true, 8, Set(1, 7), ""),
Slot((5, 0), true, 4, Set(3), "")
))
*/
board.printArr()
/*
. . . . . . . . . .
. . * * * * * * * *
. . . * . . . . . *
. * * * * * * . . *
. . . * . . * . . *
* * * * . . * . . *
. . . . . . * . * .
. . . . . . * . * .
. * * * * * * * * .
. . . . . . * . * .
*/
val arr = Array( Array('.', '.', '.', '.', '.', '.', '.', '.', '.', '.'), Array('.', '.', '*', '*', '*', '*', '*', '*', '*', '*'), Array('.', '.', '.', '*', '.', '.', '.', '.', '.', '*'), Array('.', '*', '*', '*', '*', '*', '*', '.', '.', '*'), Array('.', '.', '.', '*', '.', '.', '*', '.', '.', '*'), Array('*', '*', '*', '*', '.', '.', '*', '.', '.', '*'), Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'), Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'), Array('.', '*', '*', '*', '*', '*', '*', '*', '*', '.'), Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.') ) val board = Board.initBoardLayout(arr) /* Board(10, '.', '*', HashSet( Slot((6, 8), false, 4, Set(2), ""), Slot((3, 1), true, 6, Set(2, 5), ""), Slot((3, 6), false, 7, Set(0, 5), ""), Slot((1, 9), false, 5, Set(0), ""), Slot((1, 3), false, 5, Set(0, 2, 4), ""), Slot((8, 1), true, 8, Set(5, 7), ""), Slot((1, 2), true, 8, Set(1, 7), ""), Slot((5, 0), true, 4, Set(3), "") )) */ board.printArr() /* . . . . . . . . . . . . * * * * * * * * . . . * . . . . . * . * * * * * * . . * . . . * . . * . . * * * * * . . * . . * . . . . . . * . * . . . . . . . * . * . . * * * * * * * * . . . . . . . * . * . */
val arr = Array(
  Array('.', '.', '.', '.', '.', '.', '.', '.', '.', '.'),
  Array('.', '.', '*', '*', '*', '*', '*', '*', '*', '*'),
  Array('.', '.', '.', '*', '.', '.', '.', '.', '.', '*'),
  Array('.', '*', '*', '*', '*', '*', '*', '.', '.', '*'),
  Array('.', '.', '.', '*', '.', '.', '*', '.', '.', '*'),
  Array('*', '*', '*', '*', '.', '.', '*', '.', '.', '*'),
  Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'),
  Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.'),
  Array('.', '*', '*', '*', '*', '*', '*', '*', '*', '.'),
  Array('.', '.', '.', '.', '.', '.', '*', '.', '*', '.')
)

val board = Board.initBoardLayout(arr)
/*
Board(10, '.', '*', HashSet(
  Slot((6, 8), false, 4, Set(2), ""),
  Slot((3, 1), true, 6, Set(2, 5), ""),
  Slot((3, 6), false, 7, Set(0, 5), ""),
  Slot((1, 9), false, 5, Set(0), ""),
  Slot((1, 3), false, 5, Set(0, 2, 4), ""),
  Slot((8, 1), true, 8, Set(5, 7), ""),
  Slot((1, 2), true, 8, Set(1, 7), ""),
  Slot((5, 0), true, 4, Set(3), "")
))
*/

board.printArr()
/*
. . . . . . . . . . 
. . * * * * * * * * 
. . . * . . . . . * 
. * * * * * * . . * 
. . . * . . * . . * 
* * * * . . * . . * 
. . . . . . * . * . 
. . . . . . * . * . 
. * * * * * * * * . 
. . . . . . * . * . 
*/

Solving the crossword puzzle
解决填字游戏

The core crossword solving logic is handled by method fillSlotsWithWords() placed within the body of case class Board.
核心填字游戏求解逻辑由放置在案例类 Board 主体中的方法 fillSlotsWithWords() 处理。

case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)
...
// Fill slots with words of matching lengths & junction characters
def fillSlotsWithWords(words: Array[String]): Board = {
val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity)
val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product
val trials: Int = wLenProd * wLenProd
// Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs`
val orderedSlots = slots.groupMap(_.len)(identity).toList.
flatMap{ case (_, ss) => ss.map((ss.size, _)) }.
sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }.
map{ case (_, slot) => slot }
val jctMap: Map[(Int, Int), Char] = slots.
map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))).
reduce(_ union _).map(_->' ').toMap
def loop(slotList: List[Slot],
slotSet: Set[Slot],
wMap: Map[Int, Array[String]],
jMap: Map[(Int, Int), Char],
runs: Int): Set[Slot] = {
if (runs == 0) { // Done trying!
logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap)
slotSet
}
else {
slotList match {
case Nil =>
slotSet // Success!
case slot :: others =>
val wordsWithLen = wMap.get(slot.len)
wordsWithLen match {
case None =>
logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap)
slotSet
case Some(words) =>
words.find{ word =>
slot.jctIdxs.forall{ idx =>
val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx)
jMap(jCoords) == ' ' || word(idx) == jMap(jCoords)
}
}
match {
case Some(w) =>
val kvs = slot.jctIdxs.map { idx =>
Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx)
}
loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1)
case None =>
val newWMap = wMap + (slot.len -> (words.tail :+ words.head))
// Restart the loop with altered wordMap
loop(orderedSlots, slots, newWMap, jctMap, runs-1)
}
}
}
}
}
val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd))
(0 until sz).foreach{ i => newBoard.arr(i) = arr(i) }
newBoard.updateArrayFromSlots()
}
...
}
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) { private val arr: Array[Array[Char]] = Array.ofDim(sz, sz) ... // Fill slots with words of matching lengths & junction characters def fillSlotsWithWords(words: Array[String]): Board = { val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity) val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product val trials: Int = wLenProd * wLenProd // Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs` val orderedSlots = slots.groupMap(_.len)(identity).toList. flatMap{ case (_, ss) => ss.map((ss.size, _)) }. sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }. map{ case (_, slot) => slot } val jctMap: Map[(Int, Int), Char] = slots. map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))). reduce(_ union _).map(_->' ').toMap def loop(slotList: List[Slot], slotSet: Set[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char], runs: Int): Set[Slot] = { if (runs == 0) { // Done trying! logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap) slotSet } else { slotList match { case Nil => slotSet // Success! case slot :: others => val wordsWithLen = wMap.get(slot.len) wordsWithLen match { case None => logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap) slotSet case Some(words) => words.find{ word => slot.jctIdxs.forall{ idx => val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx) jMap(jCoords) == ' ' || word(idx) == jMap(jCoords) } } match { case Some(w) => val kvs = slot.jctIdxs.map { idx => Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx) } loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1) case None => val newWMap = wMap + (slot.len -> (words.tail :+ words.head)) // Restart the loop with altered wordMap loop(orderedSlots, slots, newWMap, jctMap, runs-1) } } } } } val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd)) (0 until sz).foreach{ i => newBoard.arr(i) = arr(i) } newBoard.updateArrayFromSlots() } ... }
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
  private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)

  ...

  // Fill slots with words of matching lengths & junction characters
  def fillSlotsWithWords(words: Array[String]): Board = {
    val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity)
    val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product
    val trials: Int = wLenProd * wLenProd
    // Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs`
    val orderedSlots = slots.groupMap(_.len)(identity).toList.
      flatMap{ case (_, ss) => ss.map((ss.size, _)) }.
      sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }.
      map{ case (_, slot) => slot }
    val jctMap: Map[(Int, Int), Char] = slots.
      map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))).
      reduce(_ union _).map(_->' ').toMap
    def loop(slotList: List[Slot],
             slotSet: Set[Slot],
             wMap: Map[Int, Array[String]],
             jMap: Map[(Int, Int), Char],
             runs: Int): Set[Slot] = {
      if (runs == 0) {  // Done trying!
        logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap)
        slotSet
      }
      else {
        slotList match {
          case Nil =>
            slotSet  // Success!
          case slot :: others =>
            val wordsWithLen = wMap.get(slot.len)
            wordsWithLen match {
              case None =>
                logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap)
                slotSet
              case Some(words) =>
                words.find{ word =>
                  slot.jctIdxs.forall{ idx =>
                    val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx)
                    jMap(jCoords) == ' ' || word(idx) == jMap(jCoords)
                  }
                }
                match {
                  case Some(w) =>
                    val kvs = slot.jctIdxs.map { idx =>
                      Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx)
                    }
                    loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1)
                  case None =>
                    val newWMap = wMap + (slot.len -> (words.tail :+ words.head))
                    // Restart the loop with altered wordMap
                    loop(orderedSlots, slots, newWMap, jctMap, runs-1)
                }
            }
        }
      }
    }
    val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd))
    (0 until sz).foreach{ i => newBoard.arr(i) = arr(i) }
    newBoard.updateArrayFromSlots()
  }

  ...
}

Method fillSlotsWithWords() takes an array of words as the parameter. Those words are supposed to fit the slots of a given initialized Board. The method creates an optimally ordered List of the slots and a Map of words grouped by the length of the words with each key associated with a list of words of the same length. In addition, it also assembles a Map of slot junctions as a helper dataset for matching intersecting characters by XY-coordinates.
方法 fillSlotsWithWords() 将单词数组作为参数。这些词应该适合给定初始化的 Board 的插槽。该方法创建槽的最佳排序 List 和按单词长度分组的单词 Map ,每个键都与相同长度的单词列表相关联。此外,它还组装了一个 Map 个槽交汇点作为辅助数据集,用于按 XY 坐标匹配相交字符。

The method then uses a recursive loop() to fill slotList and shuffle the wordMap as needed. Let me elaborate a little bit:
然后,该方法使用递归 loop() 填充 slotList 并根据需要打乱 wordMap 。让我详细说明一下:

  • Optimally ordered slots – Slots are pre-ordered so that those with the most unique slot-length and the most slot junctions will be processed first. The idea is to fill upfront the most slots with certainty (e.g. those with unique length) and lock down the most slot junction characters as subsequent matching criteria for the intersecting slots.
    最佳排序插槽 – 插槽是预先订购的,以便首先处理具有最独特插槽长度和最多插槽连接点的插槽。这个想法是确定地预先填充最多的插槽(例如,具有唯一长度的插槽),并锁定最多的插槽交汇点字符作为相交插槽的后续匹配标准。
  • Order shuffling – During the recursive processing, when a slot of a given length can no longer be filled with the remaining list of slots, the list of words of that length will be shuffled and the loop() will be reset to run with the altered wordMap.
    顺序洗牌 – 在递归处理期间,当给定长度的插槽无法再用剩余的插槽列表填充时,该长度的单词列表将被洗牌, loop() 将被重置为使用更改后的 wordMap 运行。

Finally, since it’s possible that the given crossword puzzle slot layout and words might not have a solution, a limit in the number of trials of running the recursive loop() is provided as a safety measure to avoid an infinite loop.
最后,由于给定的填字游戏插槽布局和单词可能没有解决方案,因此提供了运行递归 loop() 的试验次数限制作为安全措施,以避免无限循环。

Example:

val words = Array("mandarin", "apple", "nance", "papaya", "avocado", "date", "kiwi", "honeydew")
val filledBoard = board.fillSlotsWithWords(words)
/*
Board(10, '.', '*', HashSet(
Slot((1, 9), false, 5, Set(0), "nance"),
Slot((3, 6), false, 7, Set(0, 5), "avocado"),
Slot((5, 0), true, 4, Set(3), "date"),
Slot((8, 1), true, 8, Set(5, 7), "honeydew"),
Slot((6, 8), false, 4, Set(2), "kiwi"),
Slot((3, 1), true, 6, Set(2, 5), "papaya"),
Slot((1, 3), false, 5, Set(0, 2, 4), "apple"),
Slot((1, 2), true, 8, Set(1, 7), "mandarin")
))
*/
filledBoard.printArr()
/*
. . . . . . . . . .
. . m a n d a r i n
. . . p . . . . . a
. p a p a y a . . n
. . . l . . v . . c
d a t e . . o . . e
. . . . . . c . k .
. . . . . . a . i .
. h o n e y d e w .
. . . . . . o . i .
*/
val words = Array("mandarin", "apple", "nance", "papaya", "avocado", "date", "kiwi", "honeydew") val filledBoard = board.fillSlotsWithWords(words) /* Board(10, '.', '*', HashSet( Slot((1, 9), false, 5, Set(0), "nance"), Slot((3, 6), false, 7, Set(0, 5), "avocado"), Slot((5, 0), true, 4, Set(3), "date"), Slot((8, 1), true, 8, Set(5, 7), "honeydew"), Slot((6, 8), false, 4, Set(2), "kiwi"), Slot((3, 1), true, 6, Set(2, 5), "papaya"), Slot((1, 3), false, 5, Set(0, 2, 4), "apple"), Slot((1, 2), true, 8, Set(1, 7), "mandarin") )) */ filledBoard.printArr() /* . . . . . . . . . . . . m a n d a r i n . . . p . . . . . a . p a p a y a . . n . . . l . . v . . c d a t e . . o . . e . . . . . . c . k . . . . . . . a . i . . h o n e y d e w . . . . . . . o . i . */
val words = Array("mandarin", "apple", "nance", "papaya", "avocado", "date", "kiwi", "honeydew")

val filledBoard = board.fillSlotsWithWords(words)
/*
Board(10, '.', '*', HashSet(
  Slot((1, 9), false, 5, Set(0), "nance"),
  Slot((3, 6), false, 7, Set(0, 5), "avocado"),
  Slot((5, 0), true, 4, Set(3), "date"),
  Slot((8, 1), true, 8, Set(5, 7), "honeydew"),
  Slot((6, 8), false, 4, Set(2), "kiwi"),
  Slot((3, 1), true, 6, Set(2, 5), "papaya"),
  Slot((1, 3), false, 5, Set(0, 2, 4), "apple"),
  Slot((1, 2), true, 8, Set(1, 7), "mandarin")
))
*/

filledBoard.printArr()
/*
. . . . . . . . . . 
. . m a n d a r i n 
. . . p . . . . . a 
. p a p a y a . . n 
. . . l . . v . . c 
d a t e . . o . . e 
. . . . . . c . k . 
. . . . . . a . i . 
. h o n e y d e w . 
. . . . . . o . i . 
*/

Final thoughts

Appended is the complete source code of the classes for the crossword puzzle.
附上填字游戏类的完整源代码。

Obviously there are many different ways to formulate and solve a puzzle game like this. What’s being illustrated here is a brute-force approach, as the order shuffling routine upon hitting a wall within the slot-filling recursive loop has to reset the recursion process. The good news is that, having tested it with a dozen of random examples (admittedly a small sample), the prioritization strategy (by optimally picking slots to be evaluated) does prove to help a great deal in efficiently solving the sample games.
显然,有许多不同的方法来制定和解决这样的益智游戏。这里说明的是一种蛮力方法,因为当在插槽填充递归循环中碰壁时,顺序洗牌例程必须重置递归过程。好消息是,在用十几个随机示例(诚然是一个小样本)对其进行了测试之后,优先级策略(通过优化选择要评估的插槽)确实被证明对有效解决示例游戏有很大帮助。

case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "")
object Board {
// Initialize a board's array from provided background char, slot space char & slots
def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = {
val board = new Board(sz, bgCh, spCh, slots)
val slotCoords = slots.toList.flatMap{ slot =>
val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1)
List(i, j)
}
require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!")
for (i <- 0 until sz; j <- 0 until sz) {
board.arr(i)(j) = bgCh
}
slots.foreach{ slot =>
val (i, j) = (slot.start._1, slot.start._2)
(0 until slot.len).foreach{ k =>
val (i, j) = idxToCoords(slot.start, slot.horiz, k)
board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
}
}
board
}
// Convert an index of a slot into coordinates
def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int) = {
if (horiz) (start._1, start._2 + idx) else (start._1 + idx, start._2)
}
// Initialize a board and create slots from board array with ONLY layout
def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = {
val sz = layout.length
def findSlot(i: Int, j: Int): Option[Slot] = {
if (j < sz-1 && layout(i)(j+1) == sp) { // Horizontal
val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size
val js = (0 until ln).collect{
case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k
}.toSet
Option.when(ln > 1)(Slot((i, j), true, ln, js))
}
else { // Vertical
val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size
val js = (0 until ln).collect{
case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k
}.toSet
Option.when(ln > 1)(Slot((i, j), false, ln, js))
}
}
def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = {
if (i == sz) slots
else {
if (j == sz)
createSlots(i+1, 0, slots, mkCh)
else {
findSlot(i, j) match {
case Some(slot) =>
val jctCoords = slot.jctIdxs.map{ idx =>
Board.idxToCoords(slot.start, slot.horiz, idx)
}
(0 until slot.len).foreach { k =>
val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
if (!jctCoords.contains((i, j)))
layout(i)(j) = mkCh
}
createSlots(i, j+1, slots + slot, mkCh)
case None =>
createSlots(i, j+1, slots, mkCh)
}
}
}
}
val mkCh = '\u0000'
val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh))
(0 until sz).foreach{ i =>
newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch }
}
newBoard
}
}
case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)
// Print content of a provided array
def printArr(a: Array[Array[Char]] = arr): Unit =
for (i <- 0 until sz) {
for (j <- 0 until sz) {
print(s"${a(i)(j)} ")
if (a(i)(j) == '\u0000') print(" ") // Print a filler space for null char
}
println()
}
// Report error messages
def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit = {
println(msg)
println(s"Remaining slots: $sList")
println(s"Latest wordMap: $wMap")
println(s"Latest jctMap: $jMap")
}
// Fill slots with words of matching lengths & junction characters
def fillSlotsWithWords(words: Array[String]): Board = {
val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity)
val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product
val trials: Int = wLenProd * wLenProd
// Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs`
val orderedSlots = slots.groupMap(_.len)(identity).toList.
flatMap{ case (_, ss) => ss.map((ss.size, _)) }.
sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }.
map{ case (_, slot) => slot }
val jctMap: Map[(Int, Int), Char] = slots.
map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))).
reduce(_ union _).map(_->' ').toMap
def loop(slotList: List[Slot],
slotSet: Set[Slot],
wMap: Map[Int, Array[String]],
jMap: Map[(Int, Int), Char],
runs: Int): Set[Slot] = {
if (runs == 0) { // Done trying!
logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap)
slotSet
}
else {
slotList match {
case Nil =>
slotSet // Success!
case slot :: others =>
val wordsWithLen = wMap.get(slot.len)
wordsWithLen match {
case None =>
logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap)
slotSet
case Some(words) =>
words.find{ word =>
slot.jctIdxs.forall{ idx =>
val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx)
jMap(jCoords) == ' ' || word(idx) == jMap(jCoords)
}
}
match {
case Some(w) =>
val kvs = slot.jctIdxs.map { idx =>
Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx)
}
loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1)
case None =>
val newWMap = wMap + (slot.len -> (words.tail :+ words.head))
// Restart the loop with altered wordMap
loop(orderedSlots, slots, newWMap, jctMap, runs-1)
}
}
}
}
}
val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd))
(0 until sz).foreach{ i => newBoard.arr(i) = arr(i) }
newBoard.updateArrayFromSlots()
}
// Update board array from slots
def updateArrayFromSlots(): Board = {
slots.foreach{ slot =>
(0 until slot.len).foreach{ k =>
val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
}
}
this
}
}
case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "") object Board { // Initialize a board's array from provided background char, slot space char & slots def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = { val board = new Board(sz, bgCh, spCh, slots) val slotCoords = slots.toList.flatMap{ slot => val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1) List(i, j) } require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!") for (i <- 0 until sz; j <- 0 until sz) { board.arr(i)(j) = bgCh } slots.foreach{ slot => val (i, j) = (slot.start._1, slot.start._2) (0 until slot.len).foreach{ k => val (i, j) = idxToCoords(slot.start, slot.horiz, k) board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k) } } board } // Convert an index of a slot into coordinates def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int) = { if (horiz) (start._1, start._2 + idx) else (start._1 + idx, start._2) } // Initialize a board and create slots from board array with ONLY layout def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = { val sz = layout.length def findSlot(i: Int, j: Int): Option[Slot] = { if (j < sz-1 && layout(i)(j+1) == sp) { // Horizontal val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size val js = (0 until ln).collect{ case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k }.toSet Option.when(ln > 1)(Slot((i, j), true, ln, js)) } else { // Vertical val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size val js = (0 until ln).collect{ case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k }.toSet Option.when(ln > 1)(Slot((i, j), false, ln, js)) } } def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = { if (i == sz) slots else { if (j == sz) createSlots(i+1, 0, slots, mkCh) else { findSlot(i, j) match { case Some(slot) => val jctCoords = slot.jctIdxs.map{ idx => Board.idxToCoords(slot.start, slot.horiz, idx) } (0 until slot.len).foreach { k => val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k) if (!jctCoords.contains((i, j))) layout(i)(j) = mkCh } createSlots(i, j+1, slots + slot, mkCh) case None => createSlots(i, j+1, slots, mkCh) } } } } val mkCh = '\u0000' val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh)) (0 until sz).foreach{ i => newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch } } newBoard } } case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) { private val arr: Array[Array[Char]] = Array.ofDim(sz, sz) // Print content of a provided array def printArr(a: Array[Array[Char]] = arr): Unit = for (i <- 0 until sz) { for (j <- 0 until sz) { print(s"${a(i)(j)} ") if (a(i)(j) == '\u0000') print(" ") // Print a filler space for null char } println() } // Report error messages def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit = { println(msg) println(s"Remaining slots: $sList") println(s"Latest wordMap: $wMap") println(s"Latest jctMap: $jMap") } // Fill slots with words of matching lengths & junction characters def fillSlotsWithWords(words: Array[String]): Board = { val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity) val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product val trials: Int = wLenProd * wLenProd // Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs` val orderedSlots = slots.groupMap(_.len)(identity).toList. flatMap{ case (_, ss) => ss.map((ss.size, _)) }. sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }. map{ case (_, slot) => slot } val jctMap: Map[(Int, Int), Char] = slots. map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))). reduce(_ union _).map(_->' ').toMap def loop(slotList: List[Slot], slotSet: Set[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char], runs: Int): Set[Slot] = { if (runs == 0) { // Done trying! logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap) slotSet } else { slotList match { case Nil => slotSet // Success! case slot :: others => val wordsWithLen = wMap.get(slot.len) wordsWithLen match { case None => logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap) slotSet case Some(words) => words.find{ word => slot.jctIdxs.forall{ idx => val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx) jMap(jCoords) == ' ' || word(idx) == jMap(jCoords) } } match { case Some(w) => val kvs = slot.jctIdxs.map { idx => Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx) } loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1) case None => val newWMap = wMap + (slot.len -> (words.tail :+ words.head)) // Restart the loop with altered wordMap loop(orderedSlots, slots, newWMap, jctMap, runs-1) } } } } } val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd)) (0 until sz).foreach{ i => newBoard.arr(i) = arr(i) } newBoard.updateArrayFromSlots() } // Update board array from slots def updateArrayFromSlots(): Board = { slots.foreach{ slot => (0 until slot.len).foreach{ k => val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k) arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k) } } this } }
case class Slot(start: (Int, Int), horiz: Boolean, len: Int, jctIdxs: Set[Int], word: String = "")

object Board {
  // Initialize a board's array from provided background char, slot space char & slots
  def apply(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()): Board = {
    val board = new Board(sz, bgCh, spCh, slots)
    val slotCoords = slots.toList.flatMap{ slot =>
      val (i, j) = idxToCoords(slot.start, slot.horiz, slot.len - 1)
      List(i, j)
    }
    require(slots.isEmpty || slots.nonEmpty && slotCoords.max < sz, s"ERROR: $slots cannot be contained in ${board.arr}!")
    for (i <- 0 until sz; j <- 0 until sz) {
      board.arr(i)(j) = bgCh
    }
    slots.foreach{ slot =>
      val (i, j) = (slot.start._1, slot.start._2)
      (0 until slot.len).foreach{ k =>
        val (i, j) = idxToCoords(slot.start, slot.horiz, k)
        board.arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
      }
    }
    board
  }

  // Convert an index of a slot into coordinates
  def idxToCoords(start: (Int, Int), horiz: Boolean, idx: Int): (Int, Int) = {
    if (horiz) (start._1, start._2 + idx) else (start._1 + idx, start._2)
  }

  // Initialize a board and create slots from board array with ONLY layout
  def initBoardLayout(layout: Array[Array[Char]], bg: Char = '.', sp: Char = '*'): Board = {
    val sz = layout.length
    def findSlot(i: Int, j: Int): Option[Slot] = {
      if (j < sz-1 && layout(i)(j+1) == sp) {  // Horizontal
        val ln = Iterator.from(j).takeWhile(k => k < sz && layout(i)(k) == sp).size
        val js = (0 until ln).collect{
          case k if (i>=0+1 && layout(i-1)(j+k)!=bg) || (i<sz-1 && layout(i+1)(j+k)!=bg) => k
        }.toSet
        Option.when(ln > 1)(Slot((i, j), true, ln, js))
      }
      else {  // Vertical
        val ln = Iterator.from(i).takeWhile(k => k < sz && layout(k)(j) == sp).size
        val js = (0 until ln).collect{
          case k if (j>=0+1 && layout(i+k)(j-1)!=bg) || (j<sz-1 && layout(i+k)(j+1)!=bg) => k
        }.toSet
        Option.when(ln > 1)(Slot((i, j), false, ln, js))
      }
    }
    def createSlots(i: Int, j: Int, slots: Set[Slot], mkCh: Char): Set[Slot] = {
      if (i == sz) slots
      else {
        if (j == sz)
          createSlots(i+1, 0, slots, mkCh)
        else {
          findSlot(i, j) match {
            case Some(slot) =>
              val jctCoords = slot.jctIdxs.map{ idx =>
                Board.idxToCoords(slot.start, slot.horiz, idx)
              }
              (0 until slot.len).foreach { k =>
                val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
                if (!jctCoords.contains((i, j)))
                  layout(i)(j) = mkCh
              }
              createSlots(i, j+1, slots + slot, mkCh)
            case None =>
              createSlots(i, j+1, slots, mkCh)
          }
        }
      }
    }
    val mkCh = '\u0000'
    val newBoard = Board(sz, bg, sp).copy(slots = createSlots(0, 0, Set(), mkCh))
    (0 until sz).foreach{ i =>
      newBoard.arr(i) = layout(i).map{ ch => if (ch == mkCh) sp else ch }
    }
    newBoard
  }
}

case class Board private(sz: Int, bgCh: Char = '.', spCh: Char = '*', slots: Set[Slot] = Set()) {
  private val arr: Array[Array[Char]] = Array.ofDim(sz, sz)

  // Print content of a provided array
  def printArr(a: Array[Array[Char]] = arr): Unit =
    for (i <- 0 until sz) {
      for (j <- 0 until sz) {
        print(s"${a(i)(j)} ")
        if (a(i)(j) == '\u0000') print(" ")  // Print a filler space for null char
      }
      println()
    }

  // Report error messages
  def logError(msg: String, sList: List[Slot], wMap: Map[Int, Array[String]], jMap: Map[(Int, Int), Char]): Unit = {
    println(msg)
    println(s"Remaining slots: $sList")
    println(s"Latest wordMap: $wMap")
    println(s"Latest jctMap: $jMap")
  }

  // Fill slots with words of matching lengths & junction characters
  def fillSlotsWithWords(words: Array[String]): Board = {
    val wordMap: Map[Int, Array[String]] = words.groupMap(_.length)(identity)
    val wLenProd: Int = wordMap.map{ case (_, ws) => ws.size }.product
    val trials: Int = wLenProd * wLenProd
    // Prioritize slots with `len` of the smallest by-len group and maximal number of `jctIdxs`
    val orderedSlots = slots.groupMap(_.len)(identity).toList.
      flatMap{ case (_, ss) => ss.map((ss.size, _)) }.
      sortBy{ case (k, slot) => (k, -slot.jctIdxs.size) }.
      map{ case (_, slot) => slot }
    val jctMap: Map[(Int, Int), Char] = slots.
      map(slot => slot.jctIdxs.map(Board.idxToCoords(slot.start, slot.horiz, _))).
      reduce(_ union _).map(_->' ').toMap
    def loop(slotList: List[Slot],
             slotSet: Set[Slot],
             wMap: Map[Int, Array[String]],
             jMap: Map[(Int, Int), Char],
             runs: Int): Set[Slot] = {
      if (runs == 0) {  // Done trying!
        logError(s"FAILURE: Tried $trials times ...", slotList, wMap, jMap)
        slotSet
      }
      else {
        slotList match {
          case Nil =>
            slotSet  // Success!
          case slot :: others =>
            val wordsWithLen = wMap.get(slot.len)
            wordsWithLen match {
              case None =>
                logError(s"FAILURE: Missing words of length ${slot.len}!", slotList, wMap, jMap)
                slotSet
              case Some(words) =>
                words.find{ word =>
                  slot.jctIdxs.forall{ idx =>
                    val jCoords = Board.idxToCoords(slot.start, slot.horiz, idx)
                    jMap(jCoords) == ' ' || word(idx) == jMap(jCoords)
                  }
                }
                match {
                  case Some(w) =>
                    val kvs = slot.jctIdxs.map { idx =>
                      Board.idxToCoords(slot.start, slot.horiz, idx) -> w(idx)
                    }
                    loop(others, slotSet - slot + slot.copy(word = w), wMap, jMap ++ kvs, runs-1)
                  case None =>
                    val newWMap = wMap + (slot.len -> (words.tail :+ words.head))
                    // Restart the loop with altered wordMap
                    loop(orderedSlots, slots, newWMap, jctMap, runs-1)
                }
            }
        }
      }
    }
    val newBoard = copy(slots = loop(orderedSlots, Set(), wordMap, jctMap, wLenProd * wLenProd))
    (0 until sz).foreach{ i => newBoard.arr(i) = arr(i) }
    newBoard.updateArrayFromSlots()
  }

  // Update board array from slots
  def updateArrayFromSlots(): Board = {
    slots.foreach{ slot =>
      (0 until slot.len).foreach{ k =>
        val (i, j) = Board.idxToCoords(slot.start, slot.horiz, k)
        arr(i)(j) = if (slot.word.isEmpty) spCh else slot.word(k)
      }
    }
    this
  }
}

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Scala Binary Search Tree
斯卡拉二叉搜索树

When I wrote about Scala linked list implementation a couple of years ago, I also did some quick ground work for implementing binary search trees (BST). Occupied by other R&D projects at the time, it was put aside and has since been patiently awaiting its turn to see the light of day. As much of the code is already there, I’m going to put it up in this blog post along with some narrative remarks.
几年前,当我写关于 Scala 链表实现的文章时,我也为实现二叉搜索树 (BST) 做了一些快速的基础工作。当时被其他研发项目占据,它被搁置一旁,此后一直耐心等待轮到它看到曙光。由于大部分代码已经存在,我将把它放在这篇博文中,并附上一些叙述性评论。

First, we come up with an ADT (algebraic data type). Let’s call it BSTree, starting out with a base trait with generic type A for the data element to be stored inside the tree structure, to be extended by a case class BSBranch as tree branches and a case object BSLeaf as “null” tree nodes. The ADT’s overall structure resembles that of the one used in the linked list implementation described in the old post.
首先,我们提出一个ADT(代数数据类型)。我们称之为 BSTree ,从通用类型 A 的基本特征开始,用于将数据元素存储在树结构中,由案例类 BSBranch 扩展为树分支,案例对象 BSLeaf 扩展为“空”树节点。ADT 的整体结构类似于旧帖子中描述的链表实现中使用的结构。

ADT BSTree

A few notes:
一些注意事项:

  • Data elements (of generic type A) are stored only in branches and data of the same value will go to the same branch and be represented with a proper count value.
    数据元素(通用类型 A )仅存储在分支中,相同值的数据将转到同一分支并用适当的 count 值表示。
  • BSTree is covariant, or else BSLeaf can’t even be defined as a sub-type of BSTree[Nothing].
    BSTree 是协变的,否则 BSLeaf 甚至不能定义为 BSTree[Nothing] 的子类型。
  • A toString method is created for simplified string output of a tree instance.
    创建 toString 方法用于树实例的简化字符串输出。

Populating a BSTree
填充 BSTree

One of the first things we need is a method to insert tree nodes into an existing BSTree. We start expanding the base trait with method insert(). That’s all great for adding a node one at a time, but we also need a way to create a BSTree and populate it from a readily available collection of data elements. It makes sense to delegate such a factory method to the companion object BSTree as its method apply().
我们需要的第一件事是将树节点插入到现有 BSTree 中的方法。我们开始使用方法 insert() 扩展基本特征。这对于一次添加一个节点非常有用,但我们还需要一种方法来创建 BSTree 并从现成的数据元素集合中填充它。将这样的工厂方法委托给配套对象 BSTree 作为其方法 apply() 是有意义的。

Note that type parameter B for insert() needs to be a supertype of A because Function1 is contravariant over its parameter type. In addition, the context bound “B : Ordering” constrains type B to be capable of being ordered (i.e. compared) which is necessary for traversing a binary search tree.
请注意, insert() 的类型参数 B 必须是 A 的超类型,因为 Function1 是其参数类型的 contravariant 。此外,上下文绑定的“B:排序”约束类型 B 能够被排序(即比较),这是遍历二叉搜索树所必需的。

Testing BSTree.apply():
Testing BSTree.apply():

Tree traversal and finding tree nodes
树遍历和查找树节点

Next, we need methods for tree traversal and search. For brevity, we only include in-order traversal.
接下来,我们需要树遍历和搜索的方法。为简洁起见,我们仅包含按顺序遍历。

Using the tree created above:
使用上面创建的树:

Removing tree nodes
删除树节点

To be able to remove tree nodes that consist of a specific or range of element values, we include also the following few methods in the base trait.
为了能够删除由特定或一系列元素值组成的树节点,我们还在基本特征中包含以下几种方法。

Note that delete() may involve a little shuffling of the tree nodes. Once the tree node to be removed is located, that node may need to be filled with the node having the next-bigger element & count values from its right node (or equivalently, the node having the next-smaller element from its left node).
请注意, delete() 可能涉及对树节点进行一些洗牌。找到要删除的树节点后,可能需要使用该节点填充具有下一个较大元素的节点,并从其 right 节点(或等效地,具有来自其 left 节点的下一个较小元素的节点)。

Method trim() removes tree nodes with element values below or above the provided range. Meanwhile, method cutOut() does the opposite by cutting out tree nodes with values within the given range. It involves slightly more work than trim(), requiring the use of delete() for individual tree nodes.
方法 trim() 删除元素值低于或高于所提供范围的树节点。同时,方法 cutOut() 通过剪切具有给定范围内的值的树节点来做相反的事情。它涉及的工作比 trim() 略多,需要对单个树节点使用 delete()

Example:

Rebuilding a binary search tree
重建二叉搜索树

A highly unbalanced binary search tree beats the purpose of using such a data structure. One of the most straight forward ways to rebuild a binary search tree is to “unpack” the individual tree nodes of the existing tree by traversing in-order into a list (e.g. a Vector or List) of elements, followed by reconstructing a new tree with nodes being assigned elements from recursively half-ing the in-order node list.
高度不平衡的二叉搜索树胜过使用这种数据结构的目的。重建二叉搜索树的最直接方法之一是“解压缩”现有树的各个树节点,方法是按顺序遍历到元素列表(例如 VectorList ),然后重建一个新树,其中节点被递归地分配了元素顺序节点列表的一半。

Example:

Thoughts on the ADT
关于ADT的想法

An alternative to how the ADT is designed is to have the class fields and methods declared in the BSTree base trait with specific implementations reside within subclasses BSBranch and BSLeaf, thus eliminating the need of the boiler-plate pattern matching for the subclasses. There is also the benefit of making class fields like left & right referenceable from the base trait, though they would need to be wrapped in Options with value None for BSLeaf.
ADT设计方式的另一种方法是将具有特定实现的 BSTree 基本特征中声明的类字段和方法驻留在子类 BSBranchBSLeaf 中,从而消除了子类的样板模式匹配的需要。使像 leftright 这样的类字段可以从基本特征中引用还有一个好处,尽管它们需要包装在 Options 中,值为{{6}{7}}。

As can be seen with the existing ADT, an advantage is having all the binary tree functions defined within the base trait once and for all. If there is the need for having left and right referenceable from the BSTree base trait, one can define something like below within the trait.
从现有的ADT中可以看出,一个优点是一劳永逸地在基本特征中定义了所有二叉树函数。如果需要从 BSTree 基本特征中引用 leftright ,可以在特征中定义如下所示的内容。

Example:

Then there is also non-idiomatic approach of using mutable class fields in a single tree class commonly seen in Java implementation, like below:
然后还有一种非惯用的方法,即在 Java 实现中常见的单个树类中使用可变类字段,如下所示:

Addendum: Complete source code of the BSTree ADT
附录:BSTree ADT的完整源代码

1 thought on “Scala Binary Search Tree
关于“Scala 二叉搜索树”的 1 条思考

  1. Pingback: Trampolining with Scala TailCalls - Genuine Blog
    Pingback: Trampolining with Scala TailCalls - Genuine Blog

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

Trampolining with Scala TailCalls
蹦床与斯卡拉尾叫

It’s pretty safe to assert that all software engineers must have come across stack overflow problems over the course of their career. For a computational task that involves recursions with complex programming logic, keeping the stack frames in a controlled manner could be challenging.
可以肯定地说,所有软件工程师在他们的职业生涯中一定遇到过堆栈溢出问题。对于涉及具有复杂编程逻辑的递归的计算任务,以受控方式保持堆栈帧可能具有挑战性。

What is trampolining?
什么是蹦床?

As outlined in the Wikipedia link about trampoline, it could mean different things in different programming paradigms. In the Scala world, trampolining is a means of making a function with recursive programming logic stack-safe by formulating the recursive logic flow into tail calls among functional components wrapped in objects, resulting in code being run in the JVM heap memory.
正如维基百科关于蹦床的链接中所概述的那样,在不同的编程范式中,它可能意味着不同的东西。在 Scala 世界中,蹦床是一种使具有递归编程逻辑的函数堆栈安全的方法,方法是将递归逻辑流制定到包裹在对象中的功能组件之间的尾部调用中,从而导致代码在 JVM 堆内存中运行。

A classic example for illustrating stack overflow issues is the good old Factorial.
说明堆栈溢出问题的一个经典示例是旧的阶乘。

Tail call versus tail recursion
尾部调用与尾部递归

It should be noted that tail call can be, in a way, viewed as a generalization of tail recursion. A tail recursive function is a tail-calling function that calls only itself, whereas a general tail-calling function could be performing tail calls of other functions. Tail recursive functions are not only stack-safe they oftentimes can operate with an efficient O(n) time complexity and O(1) space complexity.
应该注意的是,尾部调用在某种程度上可以被视为尾递归的泛化。尾递归函数是仅调用自身的尾调用函数,而一般的尾部调用函数可以执行其他函数的尾调用。尾递归函数不仅是堆栈安全的,而且通常可以以高效的 O(n) 时间复杂度和 O(1) 空间复杂度运行。

On the other hand, trampolining a function provides a stack-safe solution but it does not speed up computation. To the contrary, it often takes longer time than the stack-based method to produce result (that is, until the stack overflow problem surfaces). That’s largely due to the necessary “bounces” between functions and object-wrapping.
另一方面,蹦床函数提供了一种堆栈安全的解决方案,但它不会加快计算速度。相反,它通常需要比基于堆栈的方法更长的时间才能产生结果(即,直到堆栈溢出问题浮出水面)。这主要是由于函数和对象包装之间必要的“反弹”。

Factorial with Scala TailCalls

Scala’s standard library provides TailCalls for writing stack-safe recursive functions via trampolining.
Scala的标准库提供了TailCalls,用于通过蹦床编写堆栈安全的递归函数。

The stack-safe factorial function using Scala TailCalls will look like below:
使用 Scala TailCalls 的堆栈安全阶乘函数如下所示:

Scala TailCalls under the hood
引擎盖下的斯卡拉尾巴呼叫

To understand how things work under the hood, let’s extract the skeletal components from object TailCalls‘ source code.
为了理解底层是如何工作的,让我们从对象 TailCalls 的源代码中提取骨架组件。

Object TailCalls consists of the base class TailRec which consists of transformation methods flatMap and map, and a tail-recursive method result that performs the actual evaluations. Also included are subclasses Done, Call, and Cont, which encapsulate expressions that represent, respectively, the following:
Object TailCalls 由基类 TailRecflatMapmap 组成,以及执行实际计算的尾递归方法 result 。还包括子类 DoneCallCont ,它们封装了分别表示以下内容的表达式:

  • literal returning values
    文本返回值
  • recursive calls
  • continual transformations
    持续转型

In addition, methods tailcall() and done() are exposed to the users so they don’t need to directly meddle with the subclasses.
此外,方法 tailcall()done() 向用户公开,因此他们不需要直接干预子类。

In general, to enable trampolining for a given function, make the function’s return value of type TailRec, followed by restructuring any literal function returns, recursive calls and continual transformations within the function body into corresponding subclasses of TailRec — all in a tail-calling fashion.
通常,要为给定函数启用蹦床,请将函数的返回值设置为 TailRec 类型,然后将函数体内的任何文字函数返回,递归调用和连续转换重组为 TailRec 的相应子类 - 所有这些都以尾部调用方式进行。

Methods done() and tailcall()
方法 done() 和 tailcall()

As can be seen from the TailCalls source code, all the subclasses (i.e. Done, Call, Cont) are shielded by the protected access modifier. Users are expected to use the interfacing methods done() and tailcall() along with class methods map and flatMp to formulate a tail-call version of the target function.
TailCalls 源代码中可以看出,所有子类(即 DoneCallCont ) 被 protected Access 修饰符屏蔽。用户应使用接口方法 done()tailcall() 以及类方法 mapflatMp 来制定目标函数的尾调用版本。

Method done(value) is just equivalent to Done(value), whereas tailcall(r: => TailRec) represents Call(() => r). It should be noted that the by-name parameter of tailcall() is critical to ensuring laziness, correlating to the Function0 parameter of class Call(). It’s an integral part of the stack-safe mechanics of trampolining.
方法 done(value) 等价于 Done(value) ,而 tailcall(r: => TailRec) 表示 Call(() => r) 。应该注意的是, tailcall()by-name 参数对于确保 laziness 至关重要,与类 Call()Function0 参数相关。它是蹦床堆叠安全机制不可或缺的一部分。

Methods map and flatMap
方法映射和平面映射

Back to the factorial function. In the n == 0 case, the return value is a constant, so we wrap it in a Done(). For the remaining case, it involves continual operations and we wrap it in a Call(). Obviously, we can’t simply do tailcall(n * trampFactorial(n-1)) since trampFactorial() is now an object. Rather, we transform via map with a function t => n * t, similar to how we transform the internal value of an Option or Future.
回到阶乘函数。在 n == 0 的情况下,返回值是一个常量,因此我们将其包装在 Done() 中。对于其余情况,它涉及连续操作,我们将其包装在 Call() 中。显然,我们不能简单地做 tailcall(n * trampFactorial(n-1)) ,因为 trampFactorial() 现在是一个对象。相反,我们使用函数 t => n * t 通过 map 进行转换,类似于我们如何转换 OptionFuture 的内部值。

But then tailcall(trampFactorial(n-1)).map(n * _) doesn’t look like a tail call. Why is it able to accomplish stack safety? To find out why, let’s look at how map and flatMap are implemented in the Scala TailCalls source code.
但是 tailcall(trampFactorial(n-1)).map(n * _) 看起来不像尾巴。为什么它能够实现堆栈安全?为了找出原因,让我们看看 mapflatMap 是如何在 Scala TailCalls 源代码中实现的。

Map and flatMap are “trampolines”
地图和平面地图是“蹦床”

From the source code, one can see the implementation of its class method flatMap follows the same underlying principles of trampolining — regardless of which subclass the current TailRec object belongs to, the method returns a tail-calling Call() or Cont(). That makes flatMap itself a trampolining transformation.
从源代码中,可以看到其类方法 flatMap 的实现遵循与蹦床相同的基本原则 — 无论当前 TailRec 对象属于哪个子类,该方法都会返回一个尾调用 Call()Cont() 。这使得 flatMap 本身就是一个蹦床转换。

As for method map, it’s implemented using the special case of flatMap transforming a => Call(() => Done(f(a))), which is a trampolining tail call as well. Thus, both map and flatMap are trampolining transformations. Consequently, a tail expression of an arbitrary sequence of transformations with the two methods will preserve trampolining. That gives great flexibility for users to formulate a tail-calling function.
至于方法 map ,它是使用 flatMap 转换 a => Call(() => Done(f(a))) 的特殊情况实现的,这也是一个蹦床尾调用。因此, mapflatMap 都是蹦床变换。因此,使用这两种方法的任意变换序列的尾部表达式将保留蹦床。这为用户提供了极大的灵活性来制定尾部调用功能。

Evaluating the tail-call function
评估尾部调用功能

The tail-calling function will return a TailRec object, but all that in the heap memory is just an object “wired” for the trampolining mechanism. It won’t get evaluated until class method result is called.
尾部调用函数将返回一个 TailRec 对象,但堆内存中的所有对象都只是蹦床机制“连线”的对象。在调用类方法 result 之前,不会对其进行评估。

Constructed as an optimally efficient tail-recursive function, method result evaluates the function by matching the current TailRec[A] object against each of the subclasses to carry out the programming logic accordingly to return the resulting value of the type specified as the type parameter A.
方法 result 构造为效率最佳的尾递归函数,通过将当前 TailRec[A] 对象与每个子类匹配来计算函数,以相应地执行编程逻辑以返回指定为类型参数 A 的类型的结果值。

If the current TailRec object is a Cont(a, f) which represents a transformation with function f on TailRec object a, the transformation will be carried out in accordance with what a is (thus another level of subclass matching). The class method flatMap comes in handy for carrying out the necessary composable transformation f as its signature conforms to that of the function taken by flatMap.
如果当前的 TailRec 对象是一个 Cont(a, f) ,它表示在 TailRec 对象 a 上使用函数 f 的转换,则转换将根据 a 是什么进行(因此是子类匹配的另一个级别)。类方法 flatMap 在执行必要的可组合转换 f 时派上用场,因为它的签名符合 flatMap 采用的函数的签名。

Trampolining Fibonacci
蹦床斐波那契

As a side note, Fibonacci generally will not incur stack overflow due to its relatively small space complexity, thus there is essentially no reason to apply trampolining. Nevertheless, it still serves as a good exercise of how to use TailCalls. On the other hand, a tail-recursive version of Fibonacci is highly efficient (see example in this previous post).
作为旁注,斐波那契通常不会因其相对较小的空间复杂性而产生堆栈溢出,因此基本上没有理由应用蹦床。尽管如此,它仍然可以很好地练习如何使用TailCalls。另一方面,斐波那契的尾递归版本非常高效(参见上一篇文章中的示例)。

In case it isn’t obvious, the else case expression is to sum the by-definition values of F(n-2) and F(n-1) wrapped in tailcall(F(n-2)) and tailcall(F(n-1)), respectively, via flatMap and map:
如果不明显, else 大小写表达式是分别通过 flatMapmap 对包装在 tailcall(F(n-2))tailcall(F(n-1)) 中的 F(n-2)F(n-1 的按定义值求和:

which could also be achieved using for-comprehension:
这也可以使用 for-comprehension 来实现:

Height of binary search tree
二叉搜索树的高度

Let’s look at one more trampoline example that involves computing the height of a binary search tree. The following is derived from a barebone version of the binary search tree defined in a previous blog post:
让我们再看一个涉及计算二叉搜索树高度的蹦床示例。以下内容派生自上一篇博客文章中定义的二叉搜索树的准系统版本:

The original height method is kept in trait BSTree as a reference. For the trampoline version heightTC, A helper function loop with an accumulator (i.e. ht) for aggregating the tree height is employed to tally the level of tree height. Using flatMap and map (or equivalently a for-comprehension), the main recursive tracing of the tree height follows similar tactic that the trampolining Fibonacci function uses.
原始的 height 方法保留在 trait BSTree 中作为参考。对于蹦床版本 heightTC ,带有累加器的辅助函数 loop (即 ht ) 用于聚合树的高度,用于计算树的高度。使用 flatMapmap (或等效的 for-comprehension ),树高的主要递归跟踪遵循与蹦床斐波那契函数类似的策略。

Test running heightTC:
测试运行 heightTC

While at it, a tail-recursive version of the tree height method can be created with just slightly different approach. To achieve tail recursion, a recursive function is run with a Scala List of tree node-height tuples along with a max tree height value as function parameters, as shown below.
而在这里,可以使用略有不同的方法创建树高度方法的尾递归版本。为了实现尾递归,使用树节点高度元组的 Scala 列表以及最大树高度值作为函数参数运行递归函数,如下所示。

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *

NIO-based Reactor in Scala
斯卡拉的蔚来反应堆

For high concurrency at scale, event-driven server design with non-blocking I/O operations has been one of the most popular server architectures. Nginx and Node.js, both leading server platforms in their own spaces, adopt the very technology. Among the various event-driven server implementations, the Reactor pattern remains a prominent design pattern that leverages an event loop equipped with a demultiplexer to efficiently select events that are ready to be processed by a set of event handers.
对于大规模的高并发性,具有非阻塞 I/O 操作的事件驱动型服务器设计一直是最流行的服务器体系结构之一。Nginx和Node.js都是各自领域的领先服务器平台,都采用了这项技术。在各种事件驱动的服务器实现中,Reactor 模式仍然是一个突出的设计模式,它利用配备解复用器的事件循环来有效地选择准备由一组事件处理程序处理的事件。

Back in 2013, I wrote a blog post about building a barebone server using Java NIO API to implement the Reactor pattern with non-blocking I/O in Java. The goal here is to rewrite the NIO-based Reactor server in Scala.
早在 2013 年,我就写了一篇关于使用 Java NIO API 构建准系统服务器的博客文章,以在 Java 中实现具有非阻塞 I/O 的 Reactor 模式。这里的目标是在Scala中重写基于NIO的Actor服务器。

Java NIO and Reactor pattern
Java NIO 和 Reactor 模式

A quick recap of Java NIO, which consists of the following key components:
Java NIO 的快速回顾,它由以下关键组件组成:

  • Buffer – a container of primitive typed data (e.g. Byte, Int) that can be optimized for native I/O operations with memory alignment and paging functionality
    缓冲区 – 基元类型数据(例如字节、整数)的容器,可通过内存对齐和分页功能针对本机 I/O 操作进行优化
  • Channel – a connector associated with an I/O entity (e.g. files, sockets) that supports non-blocking I/O operations
    通道 – 与支持非阻塞 I/O 操作的 I/O 实体(例如文件、套接字)关联的连接器
  • Selector – a demultiplexer on an event loop that selects events which are ready for carrying out pre-registered I/O operations (e.g. read, write)
    选择器 – 事件循环中的解复用器,用于选择准备执行预注册 I/O 操作(例如读取、写入)的事件

Note that NIO Channel implements SelectableChannel which can be registered with the Selector as a SelectionKey for any I/O operations of interest. To optimally handle high-volume client connections to the server, channels can be configured via method configureBlocking(false) to support non-blocking I/O.
请注意,NIO Channel 实现了 SelectableChannel,可以在 Selector 中注册为 SelectionKey 用于任何感兴趣的 I/O 操作。为了以最佳方式处理与服务器的大容量客户端连接,可以通过方法 configureBlocking(false) 配置通道以支持非阻塞 I/O。

With NIO Buffers enabling optimal memory access and native I/O operations, Channels programmatically connecting I/O entities, and Selector serving as the demultiplexer on an event loop selecting ready-to-go I/O events to execute in a non-blocking fashion, the Java NIO API is a great fit for implementing an effective Reactor server.
借助 NIO Buffers 支持最佳内存访问和本机 I/O 操作, Channels 以编程方式连接 I/O 实体,以及 Selector 充当事件循环中的解复用器,选择以非阻塞方式执行的现成 I/O 事件,Java NIO API 非常适合实现有效的 Reactor 服务器。

Reactor event loop
反应器事件循环

This Scala version of the NIO Reactor server consists of two main classes NioReactor and Handler, along with a trait SelKeyAttm which is the base class for objects that are to be coupled with individual selection-keys as their attachments (more on this later).
这个 Scala 版本的 NIO Reactor 服务器由两个主类 NioReactorHandler 组成,以及一个 trait SelKeyAttm ,它是要与单个选择键耦合作为其附件的对象的基类(稍后会详细介绍)。

Central to the NioReactor class is the “perpetual” event loop performed by class method selectorLoop(). It’s an recursive function that doesn’t ever return (thus returning Nothing), equivalent to the conventional infinite while(true){} loop. All it does is to repetitively check for the selection-keys whose corresponding channels are ready for the registered I/O operations and iterate through the keys to carry out the necessary work defined in the passed-in function iterFn().
NioReactor 类的核心是由类方法 selectorLoop() 执行的“永久”事件循环。它是一个递归函数,永远不会返回(因此返回 Nothing ),相当于传统的无限 while(true){} 循环。它所做的只是重复检查其相应通道已准备好用于注册 I/O 操作的选择键,并遍历这些键以执行传入函数 iterFn() 中定义的必要工作。

Function iterateSelKeys, which is passed in as the parameter for the event loop function, holds the selection-keys iteration logic. While it’s tempting to convert the Java Iterator used in the original Java application to a Scala Iterator, the idea was scrapped due to the need for the timely removal of the iterated selection-key elements via remove() which apparently is a required step for the time-critical inner working of the selector. Scala Iterator (or Iterable) does not have such method or its equivalence.
函数 iterateSelKeys 作为事件循环函数的参数传入,用于保存选择键迭代逻辑。虽然将原始Java应用程序中使用的Java Iterator 转换为Scala Iterator 很诱人,但由于需要通过 remove() 及时删除迭代的选择键元素,这个想法被废弃了,这显然是选择器时间关键内部工作的必要步骤。Scala Iterator (或 Iterable)没有这样的方法或其等价性。

Contrary to the selection-key attachments being of type Runnable in the original version, they’re now a subtype of SelKeyAttm each of which implements method run() that gets called once selected by the Selector. Using Scala Futures, Runnables are no longer the object type of the selection-key attachments. By making SelKeyAttm the base type for objects attached to the selection-keys, a slightly more specific “contract” (in the form of method specifications) is set up for those objects to adhere to.
与原始版本中的类型为 Runnable 的选择键附件相反,它们现在是 SelKeyAttm 的子类型,每个子类型都实现方法 run(),一旦被 Selector 选择就会调用。使用 Scala 期货 , Runnables 不再是选择键附件的对象类型。通过将 SelKeyAttm 作为附加到选择键的对象的基类型,可以为这些要遵守的对象设置一个稍微更具体的“协定”(以方法规范的形式)。

Acceptor

The Acceptor, associated with the NIO ServerSocketChannel for the listener socket, is a subtype of SelKeyAttm. It’s responsible for reception of server connection requests.
接受器与侦听器套接字的 NIO ServerSocketChannel 相关联,是 SelKeyAttm 的子类型。它负责接收服务器连接请求。

Part of class NioReactor’s constructor routine is to bind the ServerSocketChannel to a specified port number. It’s also where the ServerSocketChannel is configured to be non-blocking and registered with the selector it’s ready to accept connections (OP_ACCEPT), subsequently creating a selection-key with the Acceptor instance as its attachment.
类 NioRereactor 构造函数例程的一部分是将 ServerSocketChannel 绑定到指定的端口号。这也是 ServerSocketChannel 配置为非阻塞并在选择器中注册的地方,它准备接受连接 ( OP_ACCEPT ),随后使用 Acceptor 实例作为其附件创建一个选择键。

The companion object of the NioReactor class is set up with a thread pool to run the Reactor server at a provided port number in a Scala Future.
NioReactor 类的配套对象设置了一个线程池,以在 Scala Future 中提供的端口号上运行 Reactor 服务器。

Event handlers

As shown in the snippet of the Acceptor class, upon acceptance of a server connection, an instance of Handler is spawned. All events (in our case, the reading requests from and writing responses to client sockets) are processed by those handlers, which are another subtype of SelKeyAttm.
Acceptor 类的代码段所示,接受服务器连接后,将生成 Handler 的实例。所有事件(在我们的例子中,来自客户端套接字的读取请求和写入响应)都由这些处理程序处理,这些处理程序是 SelKeyAttm 的另一个子类型。

The Handler class instance takes a Selector and a SocketChannel as parameters, initializes a couple of ByteBuffers for read/write, configures the SocketChannel to be non-blocking, registers with the selector for I/O operation OP_READ, creates a selection-key with the existing handler instance as its attachment, followed by nudging the selector for immediate return of any selected channels.
Handler 类实例将 SelectorSocketChannel 作为参数,初始化几个 ByteBuffers 进行读/写,将 SocketChannel 配置为非阻塞,向选择器注册 I/O 操作 OP_READ ,使用现有处理程序实例作为其附件创建一个选择键,然后轻推选择器以立即返回任何选定的通道。

Method run() is responsible for, upon being called, carrying out the main read/write handling logic in accordance with the selection-key the passed-in SocketChannel is associated with and the corresponding I/O operation of interest.
方法 run() 负责在被调用时根据与传入的 SocketChannel 关联的选择键和感兴趣的相应 I/O 操作执行主读/写处理逻辑。

Processing read/write buffers
处理读/写缓冲区

Method read() calls channel.read(readBuf) which reads a preset number of bytes from the channel into the readBuf ByteBuffer and returns the number of Bytes read. If the channel has reached “end-of-stream”, in which case channel.read() will return -1, the corresponding selection-key will be cancelled and the channel will be closed; otherwise, processing work will commence.
方法 read() 调用 channel.read(readBuf) ,它将预设数量的字节从通道读取到 readBuf 字节缓冲区中,并返回读取的字节数。如果通道已到达“流结束”,在这种情况下 channel.read() 将返回 -1 ,相应的选择键将被取消,通道将被关闭;否则,将开始处理工作。

Method process() does the actual post-read processing work. It’s supposed to do the heavy-lifting (thus being wrapped in a Scala Future), although in this trivial server example, all it does is simply echoing whatever read from the readBuf ByteBuffer using the NIO Buffer API and write into the writeBuf ByteBuffer, followed by switching the selection-key’s I/O operation of interest to OP_WRITE.
方法 process() 执行实际的读取后处理工作。它应该做繁重的工作(因此被包装在 Scala Future 中),尽管在这个微不足道的服务器示例中,它所做的只是使用 NIO Buffer API 从 readBuf ByteBuffer 读取的任何内容并写入 writeBuf ByteBuffer,然后将选择键感兴趣的 I/O 操作切换到 OP_WRITE

Method write() calls channel.write(writeBuf) to write from the writeBuf ByteBuffer into the calling channel, followed by clearing both the read/write ByteBuffers and switching the selection-key’s I/O operation of interest back to OP_READ.
方法 write() 调用 channel.write(writeBuf) 以从 writeBuf 字节缓冲区写入调用通道,然后清除读/写字节缓冲区并将感兴趣的选择键的 I/O 操作切换回 OP_READ

Final thoughts

In this code rewrite in Scala, the main changes include the replacement of:
在 Scala 中重写的代码中,主要更改包括替换:

  • Java Runnable with Scala Future along with the base type SelKeyAttm for the Acceptor and Handler objects that are to be attached to selection-keys
    Java Runnable 与 Scala Future 以及要附加到选择键的 AcceptorHandler 对象的基类型 SelKeyAttm
  • while-loop with recursive functions
    while-loop 带递归函数
  • try-catch with Try-recover
    try-catch with Try-recover

While Java NIO is a great API for building efficient I/O-heavy applications, its underlying design apparently favors the imperative programming style. Rewriting the NIO-based Reactor server application using a functional programming language like Scala doesn’t necessarily make the code easier to read or maintain, as many function calls in the API return void (i.e. Scala Unit) and mutate variables passed in as parameters, making it difficult to be thoroughly rewritten in an idiomatic fashion.
虽然Java NIO是构建高效I / O密集型应用程序的出色API,但其底层设计显然有利于命令式编程风格。使用像 Scala 这样的函数式编程语言重写基于 NIO 的 Reactor 服务器应用程序并不一定使代码更易于阅读或维护,因为 API 中的许多函数调用返回 void (即 Scala Unit)并改变作为参数传入的变量,因此很难以惯用的方式彻底重写。

Full source code of the Scala NIO Reactor server application is available at this GitHub repo.
Scala NIO Reactor 服务器应用程序的完整源代码可在此 GitHub 存储库中找到。

To compile and run the Reactor server, git-clone the repo and run sbt from the project-root at a terminal on the server host:
若要编译和运行 Reactor 服务器,请 git-clone 存储库并从服务器主机上的终端上的项目根目录运行 sbt

Skipping the port number will bind the server to the default port 9090.
跳过 port 编号会将服务器绑定到默认端口 9090

To connect to the Reactor server, use telnet from one or more client host terminals:
要连接到 Reactor 服务器,请从一个或多个客户端主机终端使用 telnet

Any text input from the client host(s) will be echoed back by the Reactor server, which itself will also report what has been processed. Below are sample input/output from a couple of client host terminals and the server terminal:
来自客户端主机的任何文本输入都将由 Reactor 服务器回显,该服务器本身也将报告已处理的内容。以下是来自几个客户端主机终端和服务器终端的示例输入/输出:

As a side note, the output from method Handler.process() which is wrapped in a Scala Future will be reported if the server is being run from within an IDE like IntelliJ.
作为旁注,如果服务器从 IDE 中运行,则将报告包装在 Scala Future 中的方法 Handler.process() 的输出,例如 IntelliJ

Leave a Reply
留下回复

Your email address will not be published. Required fields are marked *
您的电子邮件地址将不会发布。必填字段标记为 *